数据集:
neulab/conala
任务:
语言:
计算机处理:
monolingual源数据集:
original预印本库:
arxiv:1805.08949其他:
code-generation许可:
CoNaLa 是一个用于评估代码生成任务的代码和自然语言对齐的基准数据集。该数据集从Stack Overflow上爬取而来,经过自动过滤,再由注释者进行筛选和整理,分为2,379个训练示例和500个测试示例。此外,还提供了自动采集的数据集,其中包含近60万个示例。
该数据集用于评估代码生成任务。
英语 - Python代码。
dataset_curated = load_dataset("neulab/conala")
DatasetDict({
train: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 2379
})
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 500
})
})
dataset_mined = load_dataset("neulab/conala", "mined")
DatasetDict({
train: Dataset({
features: ['question_id', 'parent_answer_post_id', 'prob', 'snippet', 'intent', 'id'],
num_rows: 593891
})
})
这是注释者整理过的数据集。
{
'question_id': 41067960,
'intent': 'How to convert a list of multiple integers into a single integer?',
'rewritten_intent': "Concatenate elements of a list 'x' of multiple integers to a single integer",
'snippet': 'sum(d * 10 ** i for i, d in enumerate(x[::-1]))'
}
CoNaLa - 经过采集的这是经过自动采集的数据集,尚未经过整理。
{
'question_id': 34705205,
'parent_answer_post_id': 34705233,
'prob': 0.8690001442846342,
'snippet': 'sorted(l, key=lambda x: (-int(x[1]), x[0]))',
'intent': 'Sort a nested list by two elements',
'id': '34705205_34705233_0'
}
整理过的数据:
| Field | Type | Description |
|---|---|---|
| question_id | int64 | Id of the Stack Overflow question |
| intent | string | Natural Language intent (i.e., the title of a Stack Overflow question) |
| rewritten_intent | string | Crowdsourced revised intents that try to better reflect the full meaning of the code |
| snippet | string | Code snippet that implements the intent |
采集过的数据:
| Field | Type | Description |
|---|---|---|
| question_id | int64 | Id of the Stack Overflow question |
| parent_answer_post_id | int64 | Id of the answer post from which the candidate snippet is extracted |
| intent | string | Natural Language intent (i.e., the title of a Stack Overflow question) |
| snippet | string | Code snippet that implements the intent |
| id | string | Unique id for this intent/snippet pair |
| prob | float64 | Probability given by the mining model |
该数据集有两个版本(经过整理和经过采集),经过采集的数据集只有一个训练集,经过整理的数据集有两个拆分:训练集和测试集。
该数据集从Stack Overflow上爬取而来,经过自动过滤,然后由注释者进行整理。更多详细信息,请参阅原始 paper 。
@inproceedings{yin2018learning,
title={Learning to mine aligned code and natural language pairs from stack overflow},
author={Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
booktitle={2018 IEEE/ACM 15th international conference on mining software repositories (MSR)},
pages={476--486},
year={2018},
organization={IEEE}
}