数据集:

wikitext

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

crowdsourced

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1609.07843

许可:

cc-by-sa-3.0

gfdl

数据集介绍文件清单

中文

Dataset Card for "wikitext"

Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Each subset comes in two different variants:

Raw (for character level work) contain the raw tokens, before the addition of the (unknown) tokens.
Non-raw (for word level work) contain only the tokens in their vocabulary (wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens). The out-of-vocabulary tokens have been replaced with the the token.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

wikitext-103-raw-v1

Size of downloaded dataset files: 191.98 MB
Size of the generated dataset: 549.42 MB
Total amount of disk used: 741.41 MB

An example of 'validation' looks as follows.

This example was too long and was cropped:

{
    "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..."
}

wikitext-103-v1

Size of downloaded dataset files: 190.23 MB
Size of the generated dataset: 548.05 MB
Total amount of disk used: 738.27 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
}

wikitext-2-raw-v1

Size of downloaded dataset files: 4.72 MB
Size of the generated dataset: 13.54 MB
Total amount of disk used: 18.26 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..."
}

wikitext-2-v1

Size of downloaded dataset files: 4.48 MB
Size of the generated dataset: 13.34 MB
Total amount of disk used: 17.82 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
}

Data Fields

The data fields are the same among all splits.

wikitext-103-raw-v1

text : a string feature.

wikitext-103-v1

text : a string feature.

wikitext-2-raw-v1

text : a string feature.

wikitext-2-v1

text : a string feature.

Data Splits

name	train	validation	test
wikitext-103-raw-v1	1801350	3760	4358
wikitext-103-v1	1801350	3760	4358
wikitext-2-raw-v1	36718	3760	4358
wikitext-2-v1	36718	3760	4358

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Additional Information

Dataset Curators

More Information Needed

Licensing Information

The dataset is available under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0) .

Citation Information

@misc{merity2016pointer,
      title={Pointer Sentinel Mixture Models},
      author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher},
      year={2016},
      eprint={1609.07843},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @thomwolf , @lewtun , @patrickvonplaten , @mariamabarham for adding this dataset.

作者:

佚名

数据集大小:

25.14 KB

Dataset Card for "wikitext"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions