英文

CC100 数据集卡片

数据集概述

该语料库是为了重建用于训练XLM-R的数据集而创建的。该语料库包含100多种语言的单语数据,还包括罗马化语言的数据(以*_rom表示)。这是使用CC-Net存储库提供的URL和段落索引,通过处理2018年1月至12月的Commoncrawl快照来构建的。

支持的任务和排行榜

CC-100主要用于预训练语言模型和单词表示。

语言

要加载不在配置中的语言,您只需要在配置中指定语言代码。您可以在数据集描述的主页部分找到有效的语言代码: https://data.statmt.org/cc-100/ 。例如:

dataset = load_dataset("cc100", lang="en")

数据集结构

数据实例

配置为am的示例:

{'id': '0', 'text': 'ተለዋዋጭ የግድግዳ አንግል ሙቅ አንቀሳቅሷል ቲ-አሞሌ አጥቅሼ ...\n'}

每个数据点都是一段文本。段落按原始(未打乱)顺序呈现。文档由一个单独的新行字符分隔。

数据字段

数据字段为:

  • id:示例的ID
  • text:内容字符串

数据拆分

一些配置的大小:

name train
am 3124561
sr 35747957

数据集创建

策划理由

[需要更多信息]

源数据

[需要更多信息]

初始数据收集和规范化

[需要更多信息]

源语言制作人是谁?

数据来自多种语言的多个网页。

注释

数据集不包含任何其他注释。

注释过程

[N/A]

注释者是谁?

[N/A]

个人和敏感信息

由于是从Common Crawl构建的,可能包含个人和敏感信息。在使用CC-100训练深度学习模型之前,必须考虑此问题,尤其是在文本生成模型的情况下。

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

该数据集是由 Statistical Machine Translation at the University of Edinburgh 使用 CC-Net 的Facebook Research工具包准备的。

许可信息

爱丁堡大学的统计机器翻译对原始语料的知识产权没有任何主张。使用本数据集时,您还受到与数据集中的内容相关的 Common Crawl terms of use 的约束。

引用信息

@inproceedings{conneau-etal-2020-unsupervised,
    title = "Unsupervised Cross-lingual Representation Learning at Scale",
    author = "Conneau, Alexis  and
      Khandelwal, Kartikay  and
      Goyal, Naman  and
      Chaudhary, Vishrav  and
      Wenzek, Guillaume  and
      Guzm{\'a}n, Francisco  and
      Grave, Edouard  and
      Ott, Myle  and
      Zettlemoyer, Luke  and
      Stoyanov, Veselin",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.747",
    doi = "10.18653/v1/2020.acl-main.747",
    pages = "8440--8451",
    abstract = "This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6{\%} average accuracy on XNLI, +13{\%} average F1 score on MLQA, and +2.4{\%} F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7{\%} in XNLI accuracy for Swahili and 11.4{\%} for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.",
}
@inproceedings{wenzek-etal-2020-ccnet,
    title = "{CCN}et: Extracting High Quality Monolingual Datasets from Web Crawl Data",
    author = "Wenzek, Guillaume  and
      Lachaux, Marie-Anne  and
      Conneau, Alexis  and
      Chaudhary, Vishrav  and
      Guzm{\'a}n, Francisco  and
      Joulin, Armand  and
      Grave, Edouard",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.494",
    pages = "4003--4012",
    abstract = "Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

贡献

感谢 @abhishekkrthakur 添加了该数据集。