数据集:
neulab/docprompting-conala
任务:
语言:
计算机处理:
monolingual源数据集:
original其他:
code-generation doc retrieval retrieval augmented generation doc+retrieval retrieval+augmented+generation许可:
This is the re-split of CoNaLa dataset. For each code snippet in the dev and test set, at least one function is held out from the training set. This split aims at testing a code generation model's capacity in generating unseen functions We further make sure that examples from the same StackOverflow post (same question_id before - ) are in the same split.
This dataset is used to evaluate code generations.
English - Python code.
dataset = load_dataset("neulab/docpromting-conala")
DatasetDict({
train: Dataset({
features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
num_rows: 2135
})
test: Dataset({
features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
num_rows: 543
})
validation: Dataset({
features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
num_rows: 201
})
})
})
code_docs = load_dataset("neulab/docprompting-conala", "docs")
DatasetDict({
train: Dataset({
features: ['doc_id', 'doc_content'],
num_rows: 34003
})
})
train/dev/test:
docs:
The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators. For more details, please refer to the original paper
@article{zhou2022doccoder,
title={DocCoder: Generating Code by Retrieving and Reading Docs},
author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and JIang, Zhengbao and Neubig, Graham},
journal={arXiv preprint arXiv:2207.05987},
year={2022}
}