数据集:

wikicorpus

任务:

填充掩码

文本分类

文本生成

子任务:

language-modeling masked-language-modeling part-of-speech

语言:

计算机处理:

monolingual

大小:

100K<n<1M 10M<n<100M 1M<n<10M

语言创建人:

found

批注创建人:

machine-generated no-annotation

源数据集:

original

其他:

word-sense-disambiguation lemmatization

许可:

gfdl

数据集介绍文件清单

英文

Wikicorpus 数据集卡片

数据集摘要

Wikicorpus 是一个三语言语料库（加泰罗尼亚语，西班牙语，英语），包含了维基百科的大部分内容（基于2006年的转储数据），并已经自动地附加了语言信息。在目前的版本中，它包含超过7.5亿个词语。

语料库使用开源的 FreeLing 库进行了词元和词性的注释。此外，它们还使用最先进的词义消歧算法 UKB 进行了词义的注释。由于 UKB 分配了 WordNet 的词义，并且通过 InterLingual Index 将 WordNet 在各种语言之间对齐，这种注释方式可以进行以往无法实现的大规模词汇语义探索。

支持的任务和排行榜

[需要更多信息]

语言

每个子数据集都是以下语言的单语语料库：

ca：加泰罗尼亚语
en：英语
es：西班牙语

数据集结构

数据实例

[需要更多信息]

数据字段

[需要更多信息]

数据分割

[需要更多信息]

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集与规范化

[需要更多信息]

源语言制造者是谁？

[需要更多信息]

注释

注释过程

[需要更多信息]

注释者是谁？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

[需要更多信息]

许可信息

WikiCorpus 的许可证与维基百科相同，即 GNU Free Documentation License

引用信息

@inproceedings{reese-etal-2010-wikicorpus,
    title = "{W}ikicorpus: A Word-Sense Disambiguated Multilingual {W}ikipedia Corpus",
    author = "Reese, Samuel  and
      Boleda, Gemma  and
      Cuadros, Montse  and
      Padr{\'o}, Llu{\'i}s  and
      Rigau, German",
    booktitle = "Proceedings of the Seventh International Conference on Language Resources and Evaluation ({LREC}'10)",
    month = may,
    year = "2010",
    address = "Valletta, Malta",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2010/pdf/222_Paper.pdf",
    abstract = "This article presents a new freely available trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia and has been automatically enriched with linguistic information. To our knowledge, this is the largest such corpus that is freely available to the community: In its present version, it contains over 750 million words. The corpora have been annotated with lemma and part of speech information using the open source library FreeLing. Also, they have been sense annotated with the state of the art Word Sense Disambiguation algorithm UKB. As UKB assigns WordNet senses, and WordNet has been aligned across languages via the InterLingual Index, this sort of annotation opens the way to massive explorations in lexical semantics that were not possible before. We present a first attempt at creating a trilingual lexical resource from the sense-tagged Wikipedia corpora, namely, WikiNet. Moreover, we present two by-products of the project that are of use for the NLP community: An open source Java-based parser for Wikipedia pages developed for the construction of the corpus, and the integration of the WSD algorithm UKB in FreeLing.",
}

贡献

感谢 @albertvillanova 添加了这个数据集。

作者:

佚名

数据集大小:

35.81 KB