数据集:

emea

任务:

翻译

语言:

计算机处理:

multilingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

英文

EMEA 数据集卡片

数据集概要

要加载不在配置中的语言对，您只需将语言代码指定为 pairs。您可以在数据集描述的主页部分找到有效的语言对: http://opus.nlpl.eu/EMEA.php 。例如

dataset = load_dataset("emea", lang1="en", lang2="nl")

支持的任务和排行榜

[需要更多信息]

语言

[需要更多信息]

数据集结构

数据实例

这是 en-nl 配置的示例:

{'id': '4',
 'translation': {'en': 'EPAR summary for the public',
  'nl': 'EPAR-samenvatting voor het publiek'}}

数据字段

数据字段为:

id: 句子对的id
translation: 一个字典，格式为 {lang1: text_in_lang1, lang2: text_in_lang2}

数据拆分

一些语言对的大小:

name	train
bg-el	1044065
cs-et	1053164
de-mt	1000532
fr-sk	1062753
es-lt	1051370

数据集创建

策划理由

[需要更多信息]

源数据

[需要更多信息]

Initial Data Collection and Normalization

[需要更多信息]

Who are the source language producers?

[需要更多信息]

注释

[需要更多信息]

Annotation process

[需要更多信息]

Who are the annotators?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集维护者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

@InProceedings{TIEDEMANN12.463,
  author = {J{\"o}rg Tiedemann},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
}

贡献

感谢 @abhishekkrthakur 添加了该数据集。

作者:

佚名

数据集大小:

18.35 KB