数据集:

bookcorpus

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

10M<n<100M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:2105.05241

许可:

license:unknown

数据集介绍文件清单

英文

BookCorpus 数据集卡片

数据集概述

书籍是丰富的信息源，既提供细粒度信息，例如人物、物体或场景的外观，又提供高层次语义信息，例如人物的思考、感受以及这些状态的演变。本工作旨在将书籍与电影的发布对齐，以提供超越当前数据集中的标题所提供的语义上丰富的图像内容描述。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

plain_text

下载的数据集文件大小：1.18 GB
生成的数据集大小：4.85 GB
所使用的磁盘总量：6.03 GB

'train' 的示例如下所示。

{
    "text": "But I traded all my life for some lovin' and some gold"
}

数据字段

所有拆分的数据字段相同。

plain_text

text：字符串特征。

数据拆分

name	train
plain_text	74004228

数据集创建过程

策划原因

More Information Needed

源数据

初始数据收集和标准化

More Information Needed

谁是源语言的生产者？

More Information Needed

注释

注释过程

More Information Needed

注释者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

附加信息

数据集策划者

More Information Needed

许可信息

这些书籍是从 https://www.smashwords.com 爬取的，详细信息请参阅他们的 terms of service 。

对于该数据集，还创建并发布了一份数据表格，详见 Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus 。

引用信息

@InProceedings{Zhu_2015_ICCV,
    title = {Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books},
    author = {Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja},
    booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
    month = {December},
    year = {2015}
}

贡献

感谢 @lewtun 、 @richarddwang 、 @lhoestq 、 @thomwolf 添加了该数据集。

作者:

佚名

数据集大小:

12.27 KB

BookCorpus 数据集卡片

数据集概述

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建过程

策划原因

源数据

注释

个人和敏感信息

使用数据的注意事项

数据集的社会影响

偏见讨论

其他已知限制

附加信息

数据集策划者

许可信息

引用信息

贡献