数据集:
ms_marco
语言:
预印本库:
arxiv:1611.09268Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search.
The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Since then we released a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.
There have been 277 submissions. 20 KeyPhrase Extraction submissions, 87 passage ranking submissions, 0 document ranking submissions, 73 QnA V2 submissions, 82 NLGEN submisions, and 15 QnA V1 submissions
This data comes in three tasks/forms: Original QnA dataset(v1.1), Question Answering(v2.1), Natural Language Generation(v2.1).
The original question answering datset featured 100,000 examples and was released in 2016. Leaderboard is now closed but data is availible below.
The current competitive tasks are Question Answering and Natural Language Generation. Question Answering features over 1,000,000 queries and is much like the original QnA dataset but bigger and with higher quality. The Natural Language Generation dataset features 180,000 examples and builds upon the QnA dataset to deliver answers that could be spoken by a smart speaker.
version v1.1
An example of 'train' looks as follows.
v2.1An example of 'validation' looks as follows.
The data fields are the same among all splits.
v1.1name | train | validation | test |
---|---|---|---|
v1.1 | 82326 | 10047 | 9650 |
v2.1 | 808731 | 101093 | 101092 |
@article{DBLP:journals/corr/NguyenRSGTMD16, author = {Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng}, title = {{MS} {MARCO:} {A} Human Generated MAchine Reading COmprehension Dataset}, journal = {CoRR}, volume = {abs/1611.09268}, year = {2016}, url = {http://arxiv.org/abs/1611.09268}, archivePrefix = {arXiv}, eprint = {1611.09268}, timestamp = {Mon, 13 Aug 2018 16:49:03 +0200}, biburl = {https://dblp.org/rec/journals/corr/NguyenRSGTMD16.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } }
Thanks to @mariamabarham , @thomwolf , @lewtun for adding this dataset.