数据集:
opus_dgt
A collection of translation memories provided by the Joint Research Centre (JRC) Directorate-General for Translation (DGT): https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
Tha dataset contains 25 languages and 299 bitexts.
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs, e.g.
dataset = load_dataset("opus_dgt", lang1="it", lang2="pl")
You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/DGT.php
[More Information Needed]
The languages in the dataset are:
{
'id': '0',
'translation': {
"bg": "Протокол за поправка на Конвенцията относно компетентността, признаването и изпълнението на съдебни решения по граждански и търговски дела, подписана в Лугано на 30 октомври 2007 г.",
"ga": "Miontuairisc cheartaitheach maidir le Coinbhinsiún ar dhlínse agus ar aithint agus ar fhorghníomhú breithiúnas in ábhair shibhialta agus tráchtála, a siníodh in Lugano an 30 Deireadh Fómhair 2007"
}
}
The dataset contains a single train split.
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@InProceedings{TIEDEMANN12.463,
author = {J{\"o}rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
Thanks to @rkc007 for adding this dataset.