neulab/docprompting-conala | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

neulab/docprompting-conala

任务:

文生文

语言:

code

计算机处理:

monolingual

大小:

size_categories:unknown

语言创建人:

crowdsourced expert-generated

源数据集:

original

预印本库:

arxiv:2207.05987 arxiv:1805.08949

其他:

code-generation doc retrieval retrieval augmented generation doc+retrieval retrieval+augmented+generation

许可:

mit

数据集介绍文件清单

中文

Dataset Summary

This is the re-split of CoNaLa dataset. For each code snippet in the dev and test set, at least one function is held out from the training set. This split aims at testing a code generation model's capacity in generating unseen functions We further make sure that examples from the same StackOverflow post (same question_id before - ) are in the same split.

Supported Tasks and Leaderboards

This dataset is used to evaluate code generations.

Languages

English - Python code.

Dataset Structure

dataset = load_dataset("neulab/docpromting-conala")
DatasetDict({
    train: Dataset({
        features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
        num_rows: 2135
    })
    test: Dataset({
        features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
        num_rows: 543
    })
    validation: Dataset({
        features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
        num_rows: 201
    })
})
})

code_docs = load_dataset("neulab/docprompting-conala", "docs")
DatasetDict({
    train: Dataset({
        features: ['doc_id', 'doc_content'],
        num_rows: 34003
    })
})

Data Fields

train/dev/test:

nl: The natural language intent
cmd: The reference code snippet
question_id: x-y where x is the StackOverflow post ID
oracle_man: The doc_id of the functions used in the reference code snippet. The corresponding contents are in doc split
canonical_cmd: The canonical version reference code snippet

docs:

doc_id: the id of a doc
doc_content: the content of the doc

Dataset Creation

The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators. For more details, please refer to the original paper

Citation Information

@article{zhou2022doccoder,
  title={DocCoder: Generating Code by Retrieving and Reading Docs},
  author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and JIang, Zhengbao and Neubig, Graham},
  journal={arXiv preprint arXiv:2207.05987},
  year={2022}
}

作者:

neulab

数据集大小:

60.04 MB