数据集:

deepmind/code_contests

任务:

翻译

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:2203.07814 arxiv:2105.12655

许可:

cc-by-4.0

数据集介绍文件清单

英文

CodeContests 数据集卡片

数据集概述

CodeContests 是一个用于机器学习的竞争性编程数据集。这个数据集被用于训练 AlphaCode 。

它包含了来自各种来源的编程问题：

Site	URL	Source
Aizu	1233321	1234321
AtCoder	1235321	1234321
CodeChef	1237321	1238321
Codeforces	1239321	1238321 and Codeforces
HackerEarth	12311321	1238321

这些问题包括一系列输入和输出的测试用例，以及各种编程语言中的正确和错误的人工解决方案。

支持的任务和排行榜

翻译 - 竞争性编程代码生成问题可以被视为序列到序列的翻译任务：给定自然语言的问题描述𝑋，生成相应的编程语言解决方案𝑌。用于评估的度量标准是“使用𝑛次提交从𝑘个问题样本中解决的问题的百分比”，表示为𝑛@𝑘。有关 AlphaCode 评估的更多信息，请参见论文的第2.2节和附录A.3。此任务的排行榜可在 here 上找到。

语言

英文。

数据集结构

数据实例

数据点对应一个竞赛问题：

{
  'name': '76_B. Mice',
  'description': 'Modern researches has shown that a flock of hungry mice '
                 'searching for a piece of...',
  'public_tests': {'input': ['3 2 0 2\n0 1 3\n2 5\n'], 'output': ['1\n']},
  'private_tests': {'input': ['20 18 1 2\n'
                              '-9999944 -9999861 -9999850 -9999763 -9999656 '
                              '-9999517 -9999375 -999927...',
                              ...,
                              '7 11 10 20\n'
                              '6 18 32 63 66 68 87\n'
                              '6 8 15 23 25 41 53 59 60 75 90\n'],
                    'output': ['2\n', ..., '1\n']},
  'generated_tests': {'input': ['7 11 10 5\n'
                                '6 18 32 63 66 68 87\n'
                                '6 8 15 23 25 41 53 59 60 75 90\n',
                                ...,
                                '7 11 10 4\n'
                                '6 18 46 63 85 84 87\n'
                                '6 8 15 18 25 41 53 59 60 75 90\n'],
                      'output': ['1\n', ..., '2\n']},
  'source': 2,
  'difficulty': 8,
  'solutions': {'language': [2, ..., 2],
                'solution': ['#include <bits/stdc++.h>\n'
                             'using namespace std;\n'
                             'int n, m;\n'
                             'int data[2][100010], t[1...',
                             ...,
                             '#include <bits/stdc++.h>\n'
                             'using namespace std;\n'
                             'int n, m, pos[100100], food[100100...']},
  'incorrect_solutions': {'language': [2, ..., 2],
                          'solution': ['#include <bits/stdc++.h>\n'
                                       'using namespace std;\n'
                                       'vector<pair<int, int> > v[100010];...',
                                       ...,
                                       '#include <bits/stdc++.h>\n'
                                       'using namespace std;\n'
                                       'vector<pair<int, int> > v[100010];...']},
  'cf_contest_id': 76,
  'cf_index': 'B',
  'cf_points': 0.0,
  'cf_rating': 2100,
  'cf_tags': ['greedy', 'two pointers'],
  'is_description_translated': False,
  'untranslated_description': '',
  'time_limit': {'seconds': 0, 'nanos': 500000000},
  'memory_limit_bytes': 256000000,
  'input_file': '',
  'output_file': ''
}

数据字段

名称：竞赛的名称。注意，不同来源可能使用相同的名称。
描述：编程问题的自然语言描述。
公共测试：在提交解决方案之前可见的测试，通常作为描述本身的一部分。以配对的输入和输出形式表示，可用于测试潜在的解决方案。因此，它们是模型的可接受输入。
私人测试：在提交解决方案之前不可见的测试，因此不应作为模型的输入。
生成的测试：通过修改公共和私人测试中的输入并使用已知的正确解决方案进行验证而自动生成的测试。
来源：问题的原始来源，可能的取值包括UNKNOWN_SOURCE（0）、CODECHEF（1）、CODEFORCES（2）、HACKEREARTH（3）、CODEJAM（4）、ATCODER（5）和AIZU（6）。
难度：问题难度的表示，可能的取值包括UNKNOWN_DIFFICULTY（0）、EASY（1）、MEDIUM（2）、HARD（3）、HARDER（4）、HARDEST（5）、EXTERNAL（6）、A（7）、B（8）、C（9）、D（10）、E（11）、F（12）、G（13）、H（14）、I（15）、J（16）、K（17）、L（18）、M（19）、N（20）、O（21）、P（22）、Q（23）、R（24）、S（25）、T（26）、U（27）和V（28）。请注意，不同的来源使用不可比较的不同分级。对于 Codeforces 问题，cf_rating 是一个更可靠的难度衡量标准（如果有的话）。
解决方案：问题的正确解决方案。与下面的 incorrect_solutions 相对比。
不正确的解决方案：不正确的解决方案。
cf_contest_id：竞赛ID。注意，竞赛ID与时间不单调相关。
cf_index：问题索引，例如“A”或“B”或“C”。
cf_points：问题的分数，例如1000.0
cf_rating：问题的评级（难度），例如1100
cf_tags：问题标签，例如['贪婪', '数学']
is_description_translated：问题是否翻译为英语。
untranslated_description：仅翻译问题可用的未翻译描述。
time_limit：执行解决方案时使用的时间限制约束，表示为包含两个键的字典，即秒和纳秒。如果未定义，则该字段为None。
memory_limit_bytes：执行解决方案时使用的内存限制约束。
input_file：大多数问题使用stdin进行IO。一些问题使用特定的文件进行IO而不是stdin。
output_file：大多数问题使用stdout进行IO。一些问题使用特定的文件进行IO而不是stdout。

所有测试都表示为一对输入和输出，可用于测试潜在的解决方案，所有解决方案都包括一种编程语言，可能的取值包括UNKNOWN_LANGUAGE（0）、PYTHON（1）（用PYTHON2编写的解决方案）、CPP（2）、PYTHON3（3）和JAVA（4），以及用该语言编写的解决方案字符串。以 cf_ 开头的字段表示 Codeforces 问题的额外元数据。

数据集拆分

数据分为训练集、验证集和测试集。训练集包含13328个样本，验证集包含117个样本，测试集包含165个样本。

数据集创建

策划理由

这个数据集是为了微调 AlphaCode 模型而创建的：

在 GitHub 上预训练的模型可以生成良好的代码并解决简单的编程问题，但正如附录 B.3 所示，它们几乎无法解决竞争性编程问题。在专门的竞争性编程数据集上对模型进行微调对于性能至关重要。

源数据

初始数据收集和规范化

关于数据收集和规范化过程的信息可以在论文的第3.2节和附录 B.2 中找到。

谁是源语言的生产者？

这些问题是从以下平台抓取而来的： Aizu , AtCoder , CodeChef , Codeforces 和 HackerEarch 。此外，还将一些来自现有的公共竞争性编程数据集 Description2Code（ Caballero et al., 2016 ）和 CodeNet（ (Puri et al., 2021 ）的数据混合到训练集中。

注释

注释过程

解决方案与问题描述一起抓取。

谁是注释者？

和源数据创建者是同一个人。

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策展人

Yujia Li、David Choi、Junyoung Chung、Nate Kushman、Julian Schrittwieser、Rémi Leblond、Tom Eccles、James Keeling、Felix Gimeno、Agustin Dal Lago、Thomas Hubert、Peter Choy、Cyprien de Masson d'Autume、Igor Babuschkin、Xinyun Chen、Po-Sen Huang、Johannes Welbl、Sven Gowal、Alexey Cherepanov、James Molloy、Daniel J. Mankowitz、Esme Sutherland Robson、Pushmeet Kohli、Nando de Freitas、Koray Kavukcuoglu 和 Oriol Vinyals。

许可信息

此数据集可根据 CC BY4.0 许可协议（ Creative Commons Attribution 4.0 International license ）使用。

其他感谢贡献：

Codeforces 材料来自 http://codeforces.com 。
Description2Code 材料来自： Description2Code Dataset ，根据 MIT open source license 许可，未指定版权。
CodeNet 材料来自： Project_CodeNet ，根据 Apache 2.0 许可，未指定版权。

引用信息

@article{li2022competition,
  title={Competition-Level Code Generation with AlphaCode},
    author={Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and
    Schrittwieser, Julian and Leblond, R{\'e}mi and Eccles, Tom and
    Keeling, James and Gimeno, Felix and Dal Lago, Agustin and
    Hubert, Thomas and Choy, Peter and de Masson d'Autume, Cyprien and
    Babuschkin, Igor and Chen, Xinyun and Huang, Po-Sen and Welbl, Johannes and
    Gowal, Sven and Cherepanov, Alexey and Molloy, James and
    Mankowitz, Daniel and Sutherland Robson, Esme and Kohli, Pushmeet and
    de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol},
  journal={arXiv preprint arXiv:2203.07814},
  year={2022}
}

贡献

感谢 @mariosasko 添加了该数据集。

作者:

deepmind

数据集大小:

7.1 GB