数据集:
emea
任务:
计算机处理:
multilingual大小:
1M<n<10M语言创建人:
found批注创建人:
found源数据集:
original许可:
要加载不在配置中的语言对,您只需将语言代码指定为 pairs。您可以在数据集描述的主页部分找到有效的语言对: http://opus.nlpl.eu/EMEA.php 。例如
dataset = load_dataset("emea", lang1="en", lang2="nl")
[需要更多信息]
[需要更多信息]
这是 en-nl 配置的示例:
{'id': '4',
'translation': {'en': 'EPAR summary for the public',
'nl': 'EPAR-samenvatting voor het publiek'}}
数据字段为:
一些语言对的大小:
| name | train |
|---|---|
| bg-el | 1044065 |
| cs-et | 1053164 |
| de-mt | 1000532 |
| fr-sk | 1062753 |
| es-lt | 1051370 |
[需要更多信息]
[需要更多信息]
Initial Data Collection and Normalization[需要更多信息]
Who are the source language producers?[需要更多信息]
[需要更多信息]
Annotation process[需要更多信息]
Who are the annotators?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@InProceedings{TIEDEMANN12.463,
author = {J{\"o}rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
感谢 @abhishekkrthakur 添加了该数据集。