数据集:
stas/wmt14-en-de-pre-processed
The original pre-processing script is here .
This pre-processed dataset was created by running:
git clone https://github.com/pytorch/fairseq cd fairseq cd examples/translation/ ./prepare-wmt14en2de.sh
It was originally used by transformers finetune_trainer.py
The data itself resides at https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz