数据集:
qanastek/ELRC-Medical-V2
ELRC-Medical-V2 is a parallel corpus for neural machine translation funded by the European Commission and coordinated by the German Research Center for Artificial Intelligence .
translation : The dataset can be used to train a model for translation.
In our case, the corpora consists of a pair of source and target sentences for 23 differents languages from the European Union (EU) with as source language in each cases english (EN).
List of languages : Bulgarian (bg) , Czech (cs) , Danish (da) , German (de) , Greek (el) , Spanish (es) , Estonian (et) , Finnish (fi) , French (fr) , Irish (ga) , Croatian (hr) , Hungarian (hu) , Italian (it) , Lithuanian (lt) , Latvian (lv) , Maltese (mt) , Dutch (nl) , Polish (pl) , Portuguese (pt) , Romanian (ro) , Slovak (sk) , Slovenian (sl) , Swedish (sv) .
from datasets import load_dataset NAME = "qanastek/ELRC-Medical-V2" dataset = load_dataset(NAME, use_auth_token=True) print(dataset) dataset_train = load_dataset(NAME, "en-es", split='train[:90%]') dataset_test = load_dataset(NAME, "en-es", split='train[10%:]') print(dataset_train) print(dataset_train[0]) print(dataset_test)
id,lang,source_text,target_text 1,en-bg,"TOC \o ""1-3"" \h \z \u Introduction 3","TOC \o ""1-3"" \h \z \u Въведение 3" 2,en-bg,The international humanitarian law and its principles are often not respected.,Международното хуманитарно право и неговите принципи често не се зачитат. 3,en-bg,"At policy level, progress was made on several important initiatives.",На равнище политики напредък е постигнат по няколко важни инициативи.
id : The document identifier of type Integer .
lang : The pair of source and target language of type String .
source_text : The source text of type String .
target_text : The target text of type String .
| Lang | # Docs | Avg. # Source Tokens | Avg. # Target Tokens |
|---|---|---|---|
| bg | 13 149 | 23 | 24 |
| cs | 13 160 | 23 | 21 |
| da | 13 242 | 23 | 22 |
| de | 13 291 | 23 | 22 |
| el | 13 091 | 23 | 26 |
| es | 13 195 | 23 | 28 |
| et | 13 016 | 23 | 17 |
| fi | 12 942 | 23 | 16 |
| fr | 13 149 | 23 | 28 |
| ga | 412 | 12 | 12 |
| hr | 12 836 | 23 | 21 |
| hu | 13 025 | 23 | 21 |
| it | 13 059 | 23 | 25 |
| lt | 12 580 | 23 | 18 |
| lv | 13 044 | 23 | 19 |
| mt | 3 093 | 16 | 14 |
| nl | 13 191 | 23 | 25 |
| pl | 12 761 | 23 | 22 |
| pt | 13 148 | 23 | 26 |
| ro | 13 163 | 23 | 25 |
| sk | 12 926 | 23 | 20 |
| sl | 13 208 | 23 | 21 |
| sv | 13 099 | 23 | 21 |
| Total | 277 780 | 22.21 | 21.47 |
For details, check the corresponding pages .
The acquisition of bilingual data (from multilingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool . Maligna aligner was used for alignment of segments. Merging/filtering of segment pairs has also been applied.
Who are the source language producers?Every data of this corpora as been uploaded by Vassilis Papavassiliou on ELRC-Share .
The corpora is free of personal or sensitive information.
The nature of the task introduce a variability in the quality of the target translations.
ELRC-Medical-V2 : Labrak Yanis, Dufour Richard
Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-XX) Corpus : Vassilis Papavassiliou and others .
This work is licensed under a
Attribution 4.0 International (CC BY 4.0) License
.
Please cite the following paper when using this model.
@inproceedings{losch-etal-2018-european,
title = European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management,
author = {
L'osch, Andrea and
Mapelli, Valérie and
Piperidis, Stelios and
Vasiljevs, Andrejs and
Smal, Lilli and
Declerck, Thierry and
Schnur, Eileen and
Choukri, Khalid and
van Genabith, Josef
},
booktitle = Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018),
month = may,
year = 2018,
address = Miyazaki, Japan,
publisher = European Language Resources Association (ELRA),
url = https://aclanthology.org/L18-1213,
}