数据集:
AhmedSSoliman/CoNaLa
This dataset has been processed for Code Generation. CMU CoNaLa, the Code/Natural Language Challenge is a joint project of the Carnegie Mellon University NeuLab and STRUDEL Lab. This dataset was designed to test systems for generating program snippets from natural language. It is avilable at https://conala-corpus.github.io/ , and this is about 13k records from the full corpus of about 600k examples.
English
A sample from this dataset looks as follows:
[
{
"intent": "convert a list to a dictionary in python",
"snippet": "b = dict(zip(a[0::2], a[1::2]))"
},
{
"intent": "python - sort a list of nested lists",
"snippet": "l.sort(key=sum_nested)"
}
]
The dataset has the following fields (also called "features"):
{
"intent": "Value(dtype='string', id=None)",
"snippet": "Value(dtype='string', id=None)"
}
This dataset is split into a train, validation and test split. The split sizes are as follow:
| Split name | Num samples |
|---|---|
| train | 11125 |
| valid | 1237 |
| test | 500 |