数据集:
commonsense_qa
任务:
子任务:
open-domain-qa语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1811.00937许可:
CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split", see paper for details.
The dataset is in English ( en ).
An example of 'train' looks as follows:
{'id': '075e483d21c29a511267ef62bedc0461',
'question': 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?',
'question_concept': 'punishing',
'choices': {'label': ['A', 'B', 'C', 'D', 'E'],
'text': ['ignore', 'enforce', 'authoritarian', 'yell at', 'avoid']},
'answerKey': 'A'}
The data fields are the same among all splits.
default| name | train | validation | test |
|---|---|---|---|
| default | 9741 | 1221 | 1140 |
The dataset is licensed under the MIT License.
See: https://github.com/jonathanherzig/commonsenseqa/issues/5
@inproceedings{talmor-etal-2019-commonsenseqa,
title = "{C}ommonsense{QA}: A Question Answering Challenge Targeting Commonsense Knowledge",
author = "Talmor, Alon and
Herzig, Jonathan and
Lourie, Nicholas and
Berant, Jonathan",
booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N19-1421",
doi = "10.18653/v1/N19-1421",
pages = "4149--4158",
archivePrefix = "arXiv",
eprint = "1811.00937",
primaryClass = "cs",
}
Thanks to @thomwolf , @lewtun , @albertvillanova , @patrickvonplaten for adding this dataset.