数据集:
hotpot_qa
任务:
语言:
计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1809.09600其他:
multi-hop许可:
HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.
An example of 'validation' looks as follows.
{
"answer": "This is the answer",
"context": {
"sentences": [["Sent 1"], ["Sent 21", "Sent 22"]],
"title": ["Title1", "Title 2"]
},
"id": "000001",
"level": "medium",
"question": "What is the answer?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["Title of para 1", "Title of para 2", "Title of para 3"]
},
"type": "comparison"
}
fullwiki
An example of 'train' looks as follows.
{
"answer": "This is the answer",
"context": {
"sentences": [["Sent 1"], ["Sent 2"]],
"title": ["Title1", "Title 2"]
},
"id": "000001",
"level": "hard",
"question": "What is the answer?",
"supporting_facts": {
"sent_id": [0, 1, 3],
"title": ["Title of para 1", "Title of para 2", "Title of para 3"]
},
"type": "bridge"
}
The data fields are the same among all splits.
distractor| train | validation | |
|---|---|---|
| distractor | 90447 | 7405 |
| train | validation | test | |
|---|---|---|---|
| fullwiki | 90447 | 7405 | 7405 |
HotpotQA is distributed under a CC BY-SA 4.0 License .
@inproceedings{yang2018hotpotqa,
title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.},
booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
year={2018}
}
Thanks to @albertvillanova , @ghomasHudson for adding this dataset.