数据集:
AhmedSSoliman/CoNaLa
This dataset has been processed for Code Generation. CMU CoNaLa, the Code/Natural Language Challenge is a joint project of the Carnegie Mellon University NeuLab and STRUDEL Lab. This dataset was designed to test systems for generating program snippets from natural language. It is avilable at https://conala-corpus.github.io/ , and this is about 13k records from the full corpus of about 600k examples.
English
A sample from this dataset looks as follows:
[
  {
    "intent": "convert a list to a dictionary in python",
    "snippet": "b = dict(zip(a[0::2], a[1::2]))"
  },
  {
    "intent": "python - sort a list of nested lists",
    "snippet": "l.sort(key=sum_nested)"
  }
]
 The dataset has the following fields (also called "features"):
{
  "intent": "Value(dtype='string', id=None)",
  "snippet": "Value(dtype='string', id=None)"
}
 This dataset is split into a train, validation and test split. The split sizes are as follow:
| Split name | Num samples | 
|---|---|
| train | 11125 | 
| valid | 1237 | 
| test | 500 |