在ParaCrawl项目中收集的网络爬取平行语料库。
数据集包含:
要加载不在配置中的语言对,您只需要指定语言代码为pairs,例如
dataset = load_dataset("opus_paracrawl", lang1="en", lang2="so")
您可以在数据集描述的主页部分找到有效的语言对: http://opus.nlpl.eu/ParaCrawl.php
[需要更多信息]
数据集中包含以下语言:
{
'id': '0',
'translation': {
"el": "Συνεχίστε ευθεία 300 μέτρα μέχρι να καταλήξουμε σε μια σωστή οδός (ul. Gagarina)? Περπατήστε περίπου 300 μέτρα μέχρι να φτάσετε το πρώτο ορθή οδός (ul Khotsa Namsaraeva)?",
"en": "Go straight 300 meters until you come to a proper street (ul. Gagarina); Walk approximately 300 meters until you reach the first proper street (ul Khotsa Namsaraeva);"
}
}
数据集包含一个train集。
[需要更多信息]
[需要更多信息]
初始数据收集和归一化[需要更多信息]
源语言生产者是谁?[需要更多信息]
[需要更多信息]
注释过程[需要更多信息]
注释者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@inproceedings{banon-etal-2020-paracrawl,
title = "{P}ara{C}rawl: Web-Scale Acquisition of Parallel Corpora",
author = "Ba{\~n}{\'o}n, Marta and
Chen, Pinzhen and
Haddow, Barry and
Heafield, Kenneth and
Hoang, Hieu and
Espl{\`a}-Gomis, Miquel and
Forcada, Mikel L. and
Kamran, Amir and
Kirefu, Faheem and
Koehn, Philipp and
Ortiz Rojas, Sergio and
Pla Sempere, Leopoldo and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Sarr{\'\i}as, Elsa and
Strelec, Marek and
Thompson, Brian and
Waites, William and
Wiggins, Dion and
Zaragoza, Jaume",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.417",
doi = "10.18653/v1/2020.acl-main.417",
pages = "4555--4567",
}
@InProceedings{TIEDEMANN12.463,
author = {Jörg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
感谢 @rkc007 添加了此数据集。