neulab/conala | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

neulab/conala

任务:

文生文

语言:

code

计算机处理:

monolingual

大小:

size_categories:unknown

语言创建人:

crowdsourced expert-generated

源数据集:

original

预印本库:

arxiv:1805.08949

其他:

code-generation

许可:

mit

数据集介绍文件清单

英文

数据集概述

CoNaLa 是一个用于评估代码生成任务的代码和自然语言对齐的基准数据集。该数据集从Stack Overflow上爬取而来，经过自动过滤，再由注释者进行筛选和整理，分为2,379个训练示例和500个测试示例。此外，还提供了自动采集的数据集，其中包含近60万个示例。

支持的任务和排行榜

该数据集用于评估代码生成任务。

语言

英语 - Python代码。

数据集结构

dataset_curated = load_dataset("neulab/conala")
DatasetDict({
    train: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 2379
    })
    test: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 500
    })
})

dataset_mined = load_dataset("neulab/conala", "mined")
DatasetDict({
    train: Dataset({
        features: ['question_id', 'parent_answer_post_id', 'prob', 'snippet', 'intent', 'id'],
        num_rows: 593891
    })
})

数据实例

CoNaLa - 经过整理的

这是注释者整理过的数据集。

{
    'question_id': 41067960,
    'intent': 'How to convert a list of multiple integers into a single integer?',
    'rewritten_intent': "Concatenate elements of a list 'x' of multiple integers to a single integer",
    'snippet': 'sum(d * 10 ** i for i, d in enumerate(x[::-1]))'
}

CoNaLa - 经过采集的

这是经过自动采集的数据集，尚未经过整理。

{
    'question_id': 34705205,
     'parent_answer_post_id': 34705233,
     'prob': 0.8690001442846342,
     'snippet': 'sorted(l, key=lambda x: (-int(x[1]), x[0]))',
     'intent': 'Sort a nested list by two elements',
     'id': '34705205_34705233_0'
}

数据字段

整理过的数据：

Field	Type	Description
question_id	int64	Id of the Stack Overflow question
intent	string	Natural Language intent (i.e., the title of a Stack Overflow question)
rewritten_intent	string	Crowdsourced revised intents that try to better reflect the full meaning of the code
snippet	string	Code snippet that implements the intent

采集过的数据：

Field	Type	Description
question_id	int64	Id of the Stack Overflow question
parent_answer_post_id	int64	Id of the answer post from which the candidate snippet is extracted
intent	string	Natural Language intent (i.e., the title of a Stack Overflow question)
snippet	string	Code snippet that implements the intent
id	string	Unique id for this intent/snippet pair
prob	float64	Probability given by the mining model

数据拆分

该数据集有两个版本（经过整理和经过采集），经过采集的数据集只有一个训练集，经过整理的数据集有两个拆分：训练集和测试集。

数据集创建

该数据集从Stack Overflow上爬取而来，经过自动过滤，然后由注释者进行整理。更多详细信息，请参阅原始 paper 。

引用信息

@inproceedings{yin2018learning,
  title={Learning to mine aligned code and natural language pairs from stack overflow},
  author={Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
  booktitle={2018 IEEE/ACM 15th international conference on mining software repositories (MSR)},
  pages={476--486},
  year={2018},
  organization={IEEE}
}

作者:

neulab

数据集大小:

153.51 MB