数据集:
opus_wikipedia
这是由Krzysztof Wołk和Krzysztof Marasek从维基百科中提取的平行句子语料库。
该数据集包含20种语言和36个双语数据集。
要加载不在配置中的语言对,您只需指定语言代码作为pairs,例如
dataset = load_dataset("opus_wikipedia", lang1="it", lang2="pl")
您可以在数据集描述的首页部分找到有效的语言对: http://opus.nlpl.eu/Wikipedia.php
[需要更多信息]
该数据集中的语言有:
{
'id': '0',
'translation': {
"ar": "* Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics.",
"en": "*Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics."
}
}
该数据集包含一个train集。
[需要更多信息]
[需要更多信息]
初始数据收集和规范化[需要更多信息]
源语言制作者是谁?[需要更多信息]
[需要更多信息]
注释过程[需要更多信息]
注释者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@article{WOLK2014126,
title = {Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs},
journal = {Procedia Technology},
volume = {18},
pages = {126-132},
year = {2014},
note = {International workshop on Innovations in Information and Communication Science and Technology, IICST 2014, 3-5 September 2014, Warsaw, Poland},
issn = {2212-0173},
doi = {https://doi.org/10.1016/j.protcy.2014.11.024},
url = {https://www.sciencedirect.com/science/article/pii/S2212017314005453},
author = {Krzysztof Wołk and Krzysztof Marasek},
keywords = {Comparable corpora, machine translation, NLP},
}
@InProceedings{TIEDEMANN12.463,
author = {J{\"o}rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
感谢 @rkc007 添加了这个数据集。