数据集:

EleutherAI/lambada_openai

计算机处理:

translation

大小:

1K<n<10K

语言创建人:

machine-generated

源数据集:

lambada

许可:

mit
中文

Dataset Summary

This dataset is comprised of the LAMBADA test split as pre-processed by OpenAI (see relevant discussions here and here ). It also contains machine translated versions of the split in German, Spanish, French, and Italian.

LAMBADA is used to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative texts sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole text, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse.

Languages

English, German, Spanish, French, and Italian.

Source Data

For non-English languages, the data splits were produced by Google Translate. See the translation_script.py for more details.

Additional Information

Hash Checksums

For data integrity checks we leave the following checksums for the files in this dataset:

File Name Checksum (SHA-256)
lambada_test_de.jsonl 51c6c1795894c46e88e4c104b5667f488efe79081fb34d746b82b8caa663865e
openai/lambada_test.jsonl 4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226
lambada_test_en.jsonl 4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226
lambada_test_es.jsonl ffd760026c647fb43c67ce1bc56fd527937304b348712dce33190ea6caba6f9c
lambada_test_fr.jsonl 941ec6a73dba7dc91c860bf493eb66a527cd430148827a4753a4535a046bf362
lambada_test_it.jsonl 86654237716702ab74f42855ae5a78455c1b0e50054a4593fb9c6fcf7fad0850

Licensing

License: Modified MIT

Citation

@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}
@misc{
    author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
    title={The LAMBADA dataset},
    DOI={10.5281/zenodo.2630551},
    publisher={Zenodo},
    year={2016},
    month={Aug}
}

Contributions

Thanks to Sid Black ( @sdtblck ) for translating the lambada_openai dataset into the non-English languages.

Thanks to Jonathan Tow ( @jon-tow ) for adding this dataset.