数据集:
neulab/mconala
MCoNaLa是一个多语言代码/自然语言挑战数据集,包含三种语言(西班牙语、日语和俄语)的896个自然语言-代码(NL-Code)对。
西班牙语、日语、俄语;Python
from datasets import load_dataset
# Spanish subset
load_dataset("neulab/mconala", "es")
DatasetDict({
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 341
})
})
# Japanese subset
load_dataset("neulab/mconala", "ja")
DatasetDict({
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 210
})
})
# Russian subset
load_dataset("neulab/mconala", "ru")
DatasetDict({
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 345
})
})
| Field | Type | Description |
|---|---|---|
| question_id | int | StackOverflow post id of the sample |
| intent | string | Title of the Stackoverflow post as the initial NL intent |
| rewritten_intent | string | nl intent rewritten by human annotators |
| snippet | string | Python code solution to the NL intent |
该数据集包含341个西班牙语样本,210个日语样本和345个俄语样本。
@article{wang2022mconala,
title={MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages},
author={Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig},
journal={arXiv preprint arXiv:2203.08388},
year={2022}
}