数据集:

ms_marco

语言:

预印本库:

arxiv:1611.09268

数据集介绍文件清单

中文

Dataset Card for "ms_marco"

Dataset Summary

Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search.

The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Since then we released a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

There have been 277 submissions. 20 KeyPhrase Extraction submissions, 87 passage ranking submissions, 0 document ranking submissions, 73 QnA V2 submissions, 82 NLGEN submisions, and 15 QnA V1 submissions

This data comes in three tasks/forms: Original QnA dataset(v1.1), Question Answering(v2.1), Natural Language Generation(v2.1).

The original question answering datset featured 100,000 examples and was released in 2016. Leaderboard is now closed but data is availible below.

The current competitive tasks are Question Answering and Natural Language Generation. Question Answering features over 1,000,000 queries and is much like the original QnA dataset but bigger and with higher quality. The Natural Language Generation dataset features 180,000 examples and builds upon the QnA dataset to deliver answers that could be spoken by a smart speaker.

version v1.1

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

v1.1

Size of downloaded dataset files: 168.69 MB
Size of the generated dataset: 434.61 MB
Total amount of disk used: 603.31 MB

An example of 'train' looks as follows.

v2.1

Size of downloaded dataset files: 1.38 GB
Size of the generated dataset: 4.29 GB
Total amount of disk used: 5.67 GB

An example of 'validation' looks as follows.

Data Fields

The data fields are the same among all splits.

v1.1

answers : a list of string features.
passages : a dictionary feature containing:
- is_selected : a int32 feature.
- passage_text : a string feature.
- url : a string feature.
query : a string feature.
query_id : a int32 feature.
query_type : a string feature.
wellFormedAnswers : a list of string features.

v2.1

answers : a list of string features.
passages : a dictionary feature containing:
- is_selected : a int32 feature.
- passage_text : a string feature.
- url : a string feature.
query : a string feature.
query_id : a int32 feature.
query_type : a string feature.
wellFormedAnswers : a list of string features.

Data Splits

name	train	validation	test
v1.1	82326	10047	9650
v2.1	808731	101093	101092

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@article{DBLP:journals/corr/NguyenRSGTMD16,
  author    = {Tri Nguyen and
               Mir Rosenberg and
               Xia Song and
               Jianfeng Gao and
               Saurabh Tiwary and
               Rangan Majumder and
               Li Deng},
  title     = {{MS} {MARCO:} {A} Human Generated MAchine Reading COmprehension Dataset},
  journal   = {CoRR},
  volume    = {abs/1611.09268},
  year      = {2016},
  url       = {http://arxiv.org/abs/1611.09268},
  archivePrefix = {arXiv},
  eprint    = {1611.09268},
  timestamp = {Mon, 13 Aug 2018 16:49:03 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/NguyenRSGTMD16.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
}

Contributions

Thanks to @mariamabarham , @thomwolf , @lewtun for adding this dataset.

作者:

佚名

数据集大小:

26.36 KB

Dataset Card for "ms_marco"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions