数据集:
code_x_glue_tt_text_to_text
CodeXGLUE text-to-text dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Text/text-to-text
The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/ .
da_en, lv_en, no_en, zh_en
An example of 'test' looks as follows.
{
"id": 0,
"source": "4 . K\u00f8r modellen , og udgiv den som en webtjeneste .\n",
"target": "4 . Run the model , and publish it as a web service .\n"
}
lv_en
An example of 'train' looks as follows.
{
"id": 0,
"source": "title : Pakalpojumu objektu izveide\n",
"target": "title : Create service objects\n"
}
no_en
An example of 'validation' looks as follows.
{
"id": 0,
"source": "2 . \u00c5pne servicevaren du vil definere komponenter fra en stykkliste for .\n",
"target": "2 . Open the service item for which you want to set up components from a BOM .\n"
}
zh_en
An example of 'validation' looks as follows.
{
"id": 0,
"source": "& # 124 ; MCDUserNotificationReadStateFilterAny & # 124 ; 0 & # 124 ; \u5305\u62ec \u901a\u77e5 , \u800c \u4e0d \u8003\u8651 \u8bfb\u53d6 \u72b6\u6001 \u3002 & # 124 ;\n",
"target": "| MCDUserNotificationReadStateFilterAny | 0 | Include notifications regardless of read state . |\n"
}
In the following each data field in go is explained for each config. The data fields are the same among all splits.
da_en, lv_en, no_en, zh_en| field name | type | description |
|---|---|---|
| id | int32 | The index of the sample |
| source | string | The source language version of the text |
| target | string | The target language version of the text |
| name | train | validation | test |
|---|---|---|---|
| da_en | 42701 | 1000 | 1000 |
| lv_en | 18749 | 1000 | 1000 |
| no_en | 44322 | 1000 | 1000 |
| zh_en | 50154 | 1000 | 1000 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
https://github.com/microsoft , https://github.com/madlag
Computational Use of Data Agreement (C-UDA) License.
@article{CodeXGLUE,
title={CodeXGLUE: A Benchmark Dataset and Open Challenge for Code Intelligence},
year={2020},}
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.