数据集:

neulab/mconala

英文

MCoNaLa数据集卡片

数据集摘要

MCoNaLa是一个多语言代码/自然语言挑战数据集,包含三种语言(西班牙语、日语和俄语)的896个自然语言-代码(NL-Code)对。

语言

西班牙语、日语、俄语;Python

数据集结构

如何使用

from datasets import load_dataset

# Spanish subset
load_dataset("neulab/mconala", "es")
DatasetDict({
    test: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 341
    })
})

# Japanese subset
load_dataset("neulab/mconala", "ja")
DatasetDict({
    test: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 210
    })
})

# Russian subset
load_dataset("neulab/mconala", "ru")
DatasetDict({
    test: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 345
    })
})

数据字段

Field Type Description
question_id int StackOverflow post id of the sample
intent string Title of the Stackoverflow post as the initial NL intent
rewritten_intent string nl intent rewritten by human annotators
snippet string Python code solution to the NL intent

数据集划分

该数据集包含341个西班牙语样本,210个日语样本和345个俄语样本。

引用信息

@article{wang2022mconala,
  title={MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages},
  author={Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig},
  journal={arXiv preprint arXiv:2203.08388},
  year={2022}
}