数据集:

cbt

语言:

en

计算机处理:

monolingual

语言创建人:

found

批注创建人:

machine-generated

源数据集:

original

预印本库:

arxiv:1511.02301

许可:

gfdl
英文

CBT数据集卡片

数据集摘要

儿童读本测试(CBT)旨在直接衡量语言模型如何利用更广泛的语言上下文。CBT是由免费提供的书籍构建的。

该数据集包含四种不同的配置:

  • V:问题的答案是动词。
  • P:问题的答案是代词。
  • NE:问题的答案是命名实体。
  • CN:问题的答案是普通名词。

支持的任务和排行榜

[需要更多信息]

语言

数据以英文形式存在,由作者Lucy Maud Montgomery、Charles Dickens、Andrew Lang等创作的儿童故事书中的文本组成。

数据集结构

数据实例

V配置的一个实例:

{'answer': 'said', 'options': ['christening', 'existed', 'hear', 'knows', 'read', 'remarked', 'said', 'sitting', 'talking', 'wearing'], 'question': "`` They are very kind old ladies in their way , '' XXXXX the king ; `` and were nice to me when I was a boy . ''", 'sentences': ['This vexed the king even more than the queen , who was very clever and learned , and who had hated dolls when she was a child .', 'However , she , too in spite of all the books she read and all the pictures she painted , would have been glad enough to be the mother of a little prince .', 'The king was anxious to consult the fairies , but the queen would not hear of such a thing .', 'She did not believe in fairies : she said that they had never existed ; and that she maintained , though The History of the Royal Family was full of chapters about nothing else .', 'Well , at long and at last they had a little boy , who was generally regarded as the finest baby that had ever been seen .', 'Even her majesty herself remarked that , though she could never believe all the courtiers told her , yet he certainly was a fine child -- a very fine child .', 'Now , the time drew near for the christening party , and the king and queen were sitting at breakfast in their summer parlour talking over it .', 'It was a splendid room , hung with portraits of the royal ancestors .', 'There was Cinderella , the grandmother of the reigning monarch , with her little foot in her glass slipper thrust out before her .', 'There was the Marquis de Carabas , who , as everyone knows , was raised to the throne as prince consort after his marriage with the daughter of the king of the period .', 'On the arm of the throne was seated his celebrated cat , wearing boots .', 'There , too , was a portrait of a beautiful lady , sound asleep : this was Madame La Belle au Bois-dormant , also an ancestress of the royal family .', 'Many other pictures of celebrated persons were hanging on the walls .', "`` You have asked all the right people , my dear ? ''", 'said the king .', "`` Everyone who should be asked , '' answered the queen .", "`` People are so touchy on these occasions , '' said his majesty .", "`` You have not forgotten any of our aunts ? ''", "`` No ; the old cats ! ''", "replied the queen ; for the king 's aunts were old-fashioned , and did not approve of her , and she knew it ."]}

数据字段

对于raw配置,数据字段如下:

  • title:一个包含数据集中书籍标题的字符串特征。
  • content:一个包含数据集中书籍内容的字符串特征。

对于其他所有配置,数据字段如下:

  • sentences:一个包含来自一本书的20个句子的字符串特征列表。
  • question:一个包含问题的字符串特征,其中空白部分用XXXX标记,需要从选项中选择一个填充。
  • answer:一个包含答案的字符串特征。
  • options:一个包含问题选项的字符串特征列表。

数据拆分

拆分及其对应的大小如下:

train test validation
raw 98 5 5
V 105825 2500 2000
P 334030 2500 2000
CN 120769 2500 2000
NE 108719 2500 2000

数据集创建

策划原理

[需要更多信息]

来源数据

初始数据收集和归一化

[需要更多信息]

源语言制片人是谁?

儿童读本作者

注释

注释过程

[需要更多信息]

注释者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据时的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

GNU Free Documentation License v1.3

引用信息

@misc{hill2016goldilocks,
      title={The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations}, 
      author={Felix Hill and Antoine Bordes and Sumit Chopra and Jason Weston},
      year={2016},
      eprint={1511.02301},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

贡献者

感谢 @gchhablani 添加了此数据集。