数据集:

race

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1704.04683

许可:

other
英文

数据集卡片"race"

数据集摘要

RACE是一个大规模的阅读理解数据集,包含超过28,000篇文章和近100,000个问题。该数据集收集自中国的英语考试,旨在为初中和高中学生设计。该数据集可用作机器阅读理解的训练和测试集。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

所有的
  • 下载的数据集文件大小:25.44 MB
  • 生成的数据集大小:174.73 MB
  • 使用的总磁盘空间量:200.17 MB

"train"的一个示例如下所示。

This example was too long and was cropped:

{
    "answer": "A",
    "article": "\"Schoolgirls have been wearing such short skirts at Paget High School in Branston that they've been ordered to wear trousers ins...",
    "example_id": "high132.txt",
    "options": ["short skirts give people the impression of sexualisation", "short skirts are too expensive for parents to afford", "the headmaster doesn't like girls wearing short skirts", "the girls wearing short skirts will be at the risk of being laughed at"],
    "question": "The girls at Paget High School are not allowed to wear skirts in that    _  ."
}
high
  • 下载的数据集文件大小:25.44 MB
  • 生成的数据集大小:140.12 MB
  • 使用的总磁盘空间量:165.56 MB

"train"的一个示例如下所示。

This example was too long and was cropped:

{
    "answer": "A",
    "article": "\"Schoolgirls have been wearing such short skirts at Paget High School in Branston that they've been ordered to wear trousers ins...",
    "example_id": "high132.txt",
    "options": ["short skirts give people the impression of sexualisation", "short skirts are too expensive for parents to afford", "the headmaster doesn't like girls wearing short skirts", "the girls wearing short skirts will be at the risk of being laughed at"],
    "question": "The girls at Paget High School are not allowed to wear skirts in that    _  ."
}
middle
  • 下载的数据集文件大小:25.44 MB
  • 生成的数据集大小:34.61 MB
  • 使用的总磁盘空间量:60.05 MB

"train"的一个示例如下所示。

This example was too long and was cropped:

{
    "answer": "B",
    "article": "\"There is not enough oil in the world now. As time goes by, it becomes less and less, so what are we going to do when it runs ou...",
    "example_id": "middle3.txt",
    "options": ["There is more petroleum than we can use now.", "Trees are needed for some other things besides making gas.", "We got electricity from ocean tides in the old days.", "Gas wasn't used to run cars in the Second World War."],
    "question": "According to the passage, which of the following statements is TRUE?"
}

数据字段

在所有拆分中数据字段相同。

所有的
  • example_id : 字符串特征。
  • article : 字符串特征。
  • answer : 字符串特征。
  • question : 字符串特征。
  • options : 字符串特征列表。
high
  • example_id : 字符串特征。
  • article : 字符串特征。
  • answer : 字符串特征。
  • question : 字符串特征。
  • options : 字符串特征列表。
middle
  • example_id : 字符串特征。
  • article : 字符串特征。
  • answer : 字符串特征。
  • question : 字符串特征。
  • options : 字符串特征列表。

数据拆分

name train validation test
all 87866 4887 4934
high 62445 3451 3498
middle 25421 1436 1436

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和标准化

More Information Needed

谁是源语言生产者?

More Information Needed

注释

注释过程

More Information Needed

注释员是谁?

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

数据的社会影响

More Information Needed

偏见讨论

More Information Needed

其他已知限制

More Information Needed

其他信息

数据集策划者

More Information Needed

许可信息

http://www.cs.cmu.edu/~glai1/data/race/

  • RACE数据集仅可供非商业研究目的使用。

  • 所有文章均来自互联网,不属于卡内基梅隆大学的财产。我们对这些文章的内容和含义不负责任。

  • 您同意不以任何商业目的复制、复制、出售、交易、转售或利用上下文的任何部分和派生数据的任何部分。

  • 我们保留随时终止您对RACE数据集的访问权利。

  • 引用信息

    @inproceedings{lai-etal-2017-race,
        title = "{RACE}: Large-scale {R}e{A}ding Comprehension Dataset From Examinations",
        author = "Lai, Guokun  and
          Xie, Qizhe  and
          Liu, Hanxiao  and
          Yang, Yiming  and
          Hovy, Eduard",
        booktitle = "Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing",
        month = sep,
        year = "2017",
        address = "Copenhagen, Denmark",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/D17-1082",
        doi = "10.18653/v1/D17-1082",
        pages = "785--794",
    }
    

    贡献

    感谢 @abarbosa94 @patrickvonplaten @lewtun @thomwolf @mariamabarham 添加了这个数据集。