数据集:
deepmind/code_contests
任务:
语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
found源数据集:
original许可:
CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode .
It consists of programming problems, from a variety of sources:
Site | URL | Source |
---|---|---|
Aizu | https://judge.u-aizu.ac.jp | CodeNet |
AtCoder | https://atcoder.jp | CodeNet |
CodeChef | https://www.codechef.com | description2code |
Codeforces | https://codeforces.com | description2code and Codeforces |
HackerEarth | https://www.hackerearth.com | description2code |
Problems include test cases in the form of paired inputs and outputs, as well as both correct and incorrect human solutions in a variety of languages.
English.
A data point corresponds to a singular contest problem:
{ 'name': '76_B. Mice', 'description': 'Modern researches has shown that a flock of hungry mice ' 'searching for a piece of...', 'public_tests': {'input': ['3 2 0 2\n0 1 3\n2 5\n'], 'output': ['1\n']}, 'private_tests': {'input': ['20 18 1 2\n' '-9999944 -9999861 -9999850 -9999763 -9999656 ' '-9999517 -9999375 -999927...', ..., '7 11 10 20\n' '6 18 32 63 66 68 87\n' '6 8 15 23 25 41 53 59 60 75 90\n'], 'output': ['2\n', ..., '1\n']}, 'generated_tests': {'input': ['7 11 10 5\n' '6 18 32 63 66 68 87\n' '6 8 15 23 25 41 53 59 60 75 90\n', ..., '7 11 10 4\n' '6 18 46 63 85 84 87\n' '6 8 15 18 25 41 53 59 60 75 90\n'], 'output': ['1\n', ..., '2\n']}, 'source': 2, 'difficulty': 8, 'solutions': {'language': [2, ..., 2], 'solution': ['#include <bits/stdc++.h>\n' 'using namespace std;\n' 'int n, m;\n' 'int data[2][100010], t[1...', ..., '#include <bits/stdc++.h>\n' 'using namespace std;\n' 'int n, m, pos[100100], food[100100...']}, 'incorrect_solutions': {'language': [2, ..., 2], 'solution': ['#include <bits/stdc++.h>\n' 'using namespace std;\n' 'vector<pair<int, int> > v[100010];...', ..., '#include <bits/stdc++.h>\n' 'using namespace std;\n' 'vector<pair<int, int> > v[100010];...']}, 'cf_contest_id': 76, 'cf_index': 'B', 'cf_points': 0.0, 'cf_rating': 2100, 'cf_tags': ['greedy', 'two pointers'], 'is_description_translated': False, 'untranslated_description': '', 'time_limit': {'seconds': 0, 'nanos': 500000000}, 'memory_limit_bytes': 256000000, 'input_file': '', 'output_file': '' }
All tests are represented as a paired input and output that can be used to test potential solutions and all solutions comprise a language , with possible values including UNKNOWN_LANGUAGE (0), PYTHON (1) (solutions written in PYTHON2), CPP (2), PYTHON3 (3) and JAVA (4), and a solution string written in that language . The fields preceded with cf_ denote extra meta-data for Codeforces problems.
The data is split into training, validation and test set. The training set contains 13328 samples, the validation set 117 samples and the test set 165 samples.
This dataset was created for fine-tuning AlphaCode models:
Models pre-trained on GitHub can generate good code and solve simple programming problems, but as shown in Appendix B.3 they can solve very few competitive programming problems. Fine-tuning the model on a dedicated competitive programming dataset is critical for performance.
The information on the data collection and normalization procedures can found in Section 3.2. and Appendinx B.2. of the paper.
Who are the source language producers?The problems are scraped from the following platforms: Aizu , AtCoder , CodeChef , Codeforces and HackerEarch . Additionally, some data from the existing public competitive programming dataset Description2Code ( Caballero et al., 2016 ) and CodeNet ( (Puri et al., 2021 ) is mixed into the training set.
The solutions are scapred alongside the problem descriptions.
Who are the annotators?Same as the source data creators.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu and Oriol Vinyals.
This dataset is made available under the terms of the CC BY 4.0 license ( Creative Commons Attribution 4.0 International license ).
Additional acknowledged contributions:
@article{li2022competition, title={Competition-Level Code Generation with AlphaCode}, author={Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R{\'e}mi and Eccles, Tom and Keeling, James and Gimeno, Felix and Dal Lago, Agustin and Hubert, Thomas and Choy, Peter and de Masson d'Autume, Cyprien and Babuschkin, Igor and Chen, Xinyun and Huang, Po-Sen and Welbl, Johannes and Gowal, Sven and Cherepanov, Alexey and Molloy, James and Mankowitz, Daniel and Sutherland Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol}, journal={arXiv preprint arXiv:2203.07814}, year={2022} }
Thanks to @mariosasko for adding this dataset.