这是一种神经机器翻译模型,用于将意大利语言(itc)翻译为意大利语言(itc)。
该模型是 OPUS-MT project 的一部分,该项目旨在使神经机器翻译模型在世界上许多语言中得到广泛应用和普及。所有模型都是使用惊人的 Marian NMT 框架进行训练的,该框架是用纯粹的 C++ 编写的高效 NMT 实现。这些模型已经使用 huggingface 的 transformers 库转换为 pyTorch。训练数据来自 OPUS ,训练流程使用 OPUS-MT-train 的流程。
模型描述:
这是一个具有多个目标语言的多语种翻译模型。需要以 >>id<lt; 的形式提供一个句子开头的语言标记(id = 有效的目标语言ID),例如 >>ast<lt;
该模型可用于翻译和文本生成。
内容警告:读者应意识到该模型是基于可能包含令人不安、冒犯和可能传播历史和现实刻板印象的各种公共数据集进行训练的。
已经有大量的研究探讨了语言模型的偏见和公平性问题(参见,例如, Sheng et al. (2021) 和 Bender et al. (2021) )。
一个简短的示例代码:
from transformers import MarianMTModel, MarianTokenizer
src_text = [
">>fra<< Charras anglés?",
">>fra<< Vull veure't."
]
model_name = "pytorch-models/opus-mt-tc-big-itc-itc"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
# expected output:
# Conversations anglaises ?
# Je veux te voir.
您还可以使用 transformers pipelines 来使用 OPUS-MT 模型,例如:
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-itc-itc")
print(pipe(">>fra<< Charras anglés?"))
# expected output: Conversations anglaises ?
| langpair | testset | chr-F | BLEU | #sent | #words |
|---|---|---|---|---|---|
| cat-fra | tatoeba-test-v2021-08-07 | 0.71201 | 54.6 | 700 | 5664 |
| cat-ita | tatoeba-test-v2021-08-07 | 0.74198 | 58.4 | 298 | 2028 |
| cat-por | tatoeba-test-v2021-08-07 | 0.74930 | 57.4 | 747 | 6119 |
| cat-spa | tatoeba-test-v2021-08-07 | 0.87844 | 78.1 | 1534 | 12094 |
| fra-cat | tatoeba-test-v2021-08-07 | 0.66525 | 46.2 | 700 | 5342 |
| fra-ita | tatoeba-test-v2021-08-07 | 0.72742 | 53.8 | 10091 | 62060 |
| fra-por | tatoeba-test-v2021-08-07 | 0.68413 | 48.6 | 10518 | 77650 |
| fra-ron | tatoeba-test-v2021-08-07 | 0.65009 | 44.0 | 1925 | 12252 |
| fra-spa | tatoeba-test-v2021-08-07 | 0.72080 | 54.8 | 10294 | 78406 |
| glg-por | tatoeba-test-v2021-08-07 | 0.76720 | 61.1 | 433 | 3105 |
| glg-spa | tatoeba-test-v2021-08-07 | 0.82362 | 71.7 | 2121 | 17443 |
| ita-cat | tatoeba-test-v2021-08-07 | 0.72529 | 56.4 | 298 | 2109 |
| ita-fra | tatoeba-test-v2021-08-07 | 0.77932 | 65.2 | 10091 | 66377 |
| ita-por | tatoeba-test-v2021-08-07 | 0.72798 | 54.0 | 3066 | 25668 |
| ita-ron | tatoeba-test-v2021-08-07 | 0.70814 | 51.1 | 1005 | 6209 |
| ita-spa | tatoeba-test-v2021-08-07 | 0.77455 | 62.9 | 5000 | 34937 |
| lad_Latn-spa | tatoeba-test-v2021-08-07 | 0.59363 | 42.6 | 239 | 1239 |
| lad-spa | tatoeba-test-v2021-08-07 | 0.52243 | 34.7 | 276 | 1448 |
| oci-fra | tatoeba-test-v2021-08-07 | 0.49660 | 29.6 | 806 | 6302 |
| pms-ita | tatoeba-test-v2021-08-07 | 0.40221 | 20.0 | 232 | 1721 |
| por-cat | tatoeba-test-v2021-08-07 | 0.71146 | 52.2 | 747 | 6149 |
| por-fra | tatoeba-test-v2021-08-07 | 0.75565 | 60.9 | 10518 | 80459 |
| por-glg | tatoeba-test-v2021-08-07 | 0.75348 | 59.0 | 433 | 3016 |
| por-ita | tatoeba-test-v2021-08-07 | 0.76883 | 58.8 | 3066 | 24897 |
| por-ron | tatoeba-test-v2021-08-07 | 0.67838 | 46.6 | 681 | 4521 |
| por-spa | tatoeba-test-v2021-08-07 | 0.79336 | 64.8 | 10947 | 87335 |
| ron-fra | tatoeba-test-v2021-08-07 | 0.70307 | 55.0 | 1925 | 13347 |
| ron-ita | tatoeba-test-v2021-08-07 | 0.73862 | 53.7 | 1005 | 6352 |
| ron-por | tatoeba-test-v2021-08-07 | 0.70889 | 50.7 | 681 | 4593 |
| ron-spa | tatoeba-test-v2021-08-07 | 0.73529 | 57.2 | 1959 | 12679 |
| spa-cat | tatoeba-test-v2021-08-07 | 0.82758 | 67.9 | 1534 | 12343 |
| spa-fra | tatoeba-test-v2021-08-07 | 0.73113 | 57.3 | 10294 | 83501 |
| spa-glg | tatoeba-test-v2021-08-07 | 0.77332 | 63.0 | 2121 | 16581 |
| spa-ita | tatoeba-test-v2021-08-07 | 0.77046 | 60.3 | 5000 | 34515 |
| spa-lad_Latn | tatoeba-test-v2021-08-07 | 0.40084 | 14.7 | 239 | 1254 |
| spa-por | tatoeba-test-v2021-08-07 | 0.75854 | 59.1 | 10947 | 87610 |
| spa-ron | tatoeba-test-v2021-08-07 | 0.66679 | 45.5 | 1959 | 12503 |
| ast-cat | flores101-devtest | 0.57870 | 31.8 | 1012 | 27304 |
| ast-fra | flores101-devtest | 0.56761 | 31.1 | 1012 | 28343 |
| ast-glg | flores101-devtest | 0.55161 | 27.9 | 1012 | 26582 |
| ast-ita | flores101-devtest | 0.51764 | 22.1 | 1012 | 27306 |
| ast-oci | flores101-devtest | 0.49545 | 20.6 | 1012 | 27305 |
| ast-por | flores101-devtest | 0.57347 | 31.5 | 1012 | 26519 |
| ast-ron | flores101-devtest | 0.52317 | 24.8 | 1012 | 26799 |
| ast-spa | flores101-devtest | 0.49741 | 21.2 | 1012 | 29199 |
| cat-ast | flores101-devtest | 0.56754 | 24.7 | 1012 | 24572 |
| cat-fra | flores101-devtest | 0.63368 | 38.4 | 1012 | 28343 |
| cat-glg | flores101-devtest | 0.59596 | 32.2 | 1012 | 26582 |
| cat-ita | flores101-devtest | 0.55886 | 26.3 | 1012 | 27306 |
| cat-oci | flores101-devtest | 0.54285 | 24.6 | 1012 | 27305 |
| cat-por | flores101-devtest | 0.62913 | 37.7 | 1012 | 26519 |
| cat-ron | flores101-devtest | 0.56885 | 29.5 | 1012 | 26799 |
| cat-spa | flores101-devtest | 0.53372 | 24.6 | 1012 | 29199 |
| fra-ast | flores101-devtest | 0.52696 | 20.7 | 1012 | 24572 |
| fra-cat | flores101-devtest | 0.60492 | 34.6 | 1012 | 27304 |
| fra-glg | flores101-devtest | 0.57485 | 30.3 | 1012 | 26582 |
| fra-ita | flores101-devtest | 0.56493 | 27.3 | 1012 | 27306 |
| fra-oci | flores101-devtest | 0.57449 | 28.2 | 1012 | 27305 |
| fra-por | flores101-devtest | 0.62211 | 36.9 | 1012 | 26519 |
| fra-ron | flores101-devtest | 0.56998 | 29.4 | 1012 | 26799 |
| fra-spa | flores101-devtest | 0.52880 | 24.2 | 1012 | 29199 |
| glg-ast | flores101-devtest | 0.55090 | 22.4 | 1012 | 24572 |
| glg-cat | flores101-devtest | 0.60550 | 32.6 | 1012 | 27304 |
| glg-fra | flores101-devtest | 0.62026 | 36.0 | 1012 | 28343 |
| glg-ita | flores101-devtest | 0.55834 | 25.9 | 1012 | 27306 |
| glg-oci | flores101-devtest | 0.52520 | 21.9 | 1012 | 27305 |
| glg-por | flores101-devtest | 0.60027 | 32.7 | 1012 | 26519 |
| glg-ron | flores101-devtest | 0.55621 | 27.8 | 1012 | 26799 |
| glg-spa | flores101-devtest | 0.53219 | 24.4 | 1012 | 29199 |
| ita-ast | flores101-devtest | 0.50741 | 17.1 | 1012 | 24572 |
| ita-cat | flores101-devtest | 0.57061 | 27.9 | 1012 | 27304 |
| ita-fra | flores101-devtest | 0.60199 | 32.0 | 1012 | 28343 |
| ita-glg | flores101-devtest | 0.55312 | 25.9 | 1012 | 26582 |
| ita-oci | flores101-devtest | 0.48447 | 18.1 | 1012 | 27305 |
| ita-por | flores101-devtest | 0.58162 | 29.0 | 1012 | 26519 |
| ita-ron | flores101-devtest | 0.53703 | 24.2 | 1012 | 26799 |
| ita-spa | flores101-devtest | 0.52238 | 23.1 | 1012 | 29199 |
| oci-ast | flores101-devtest | 0.53010 | 20.2 | 1012 | 24572 |
| oci-cat | flores101-devtest | 0.59946 | 32.2 | 1012 | 27304 |
| oci-fra | flores101-devtest | 0.64290 | 39.0 | 1012 | 28343 |
| oci-glg | flores101-devtest | 0.56737 | 28.0 | 1012 | 26582 |
| oci-ita | flores101-devtest | 0.54220 | 24.2 | 1012 | 27306 |
| oci-por | flores101-devtest | 0.62127 | 35.7 | 1012 | 26519 |
| oci-ron | flores101-devtest | 0.55906 | 28.0 | 1012 | 26799 |
| oci-spa | flores101-devtest | 0.52110 | 22.8 | 1012 | 29199 |
| por-ast | flores101-devtest | 0.54539 | 22.5 | 1012 | 24572 |
| por-cat | flores101-devtest | 0.61809 | 36.4 | 1012 | 27304 |
| por-fra | flores101-devtest | 0.64343 | 39.7 | 1012 | 28343 |
| por-glg | flores101-devtest | 0.57965 | 30.4 | 1012 | 26582 |
| por-ita | flores101-devtest | 0.55841 | 26.3 | 1012 | 27306 |
| por-oci | flores101-devtest | 0.54829 | 25.3 | 1012 | 27305 |
| por-ron | flores101-devtest | 0.57283 | 29.8 | 1012 | 26799 |
| por-spa | flores101-devtest | 0.53513 | 25.2 | 1012 | 29199 |
| ron-ast | flores101-devtest | 0.52265 | 20.1 | 1012 | 24572 |
| ron-cat | flores101-devtest | 0.59689 | 32.6 | 1012 | 27304 |
| ron-fra | flores101-devtest | 0.63060 | 37.4 | 1012 | 28343 |
| ron-glg | flores101-devtest | 0.56677 | 29.3 | 1012 | 26582 |
| ron-ita | flores101-devtest | 0.55485 | 25.6 | 1012 | 27306 |
| ron-oci | flores101-devtest | 0.52433 | 21.8 | 1012 | 27305 |
| ron-por | flores101-devtest | 0.61831 | 36.1 | 1012 | 26519 |
| ron-spa | flores101-devtest | 0.52712 | 24.1 | 1012 | 29199 |
| spa-ast | flores101-devtest | 0.49008 | 15.7 | 1012 | 24572 |
| spa-cat | flores101-devtest | 0.53905 | 23.2 | 1012 | 27304 |
| spa-fra | flores101-devtest | 0.57078 | 27.4 | 1012 | 28343 |
| spa-glg | flores101-devtest | 0.52563 | 22.0 | 1012 | 26582 |
| spa-ita | flores101-devtest | 0.52783 | 22.3 | 1012 | 27306 |
| spa-oci | flores101-devtest | 0.48064 | 16.3 | 1012 | 27305 |
| spa-por | flores101-devtest | 0.55736 | 25.8 | 1012 | 26519 |
| spa-ron | flores101-devtest | 0.51623 | 21.4 | 1012 | 26799 |
| fra-ita | newssyscomb2009 | 0.60995 | 32.1 | 502 | 11551 |
| fra-spa | newssyscomb2009 | 0.60224 | 34.2 | 502 | 12503 |
| ita-fra | newssyscomb2009 | 0.61237 | 33.7 | 502 | 12331 |
| ita-spa | newssyscomb2009 | 0.60706 | 35.4 | 502 | 12503 |
| spa-fra | newssyscomb2009 | 0.61290 | 34.6 | 502 | 12331 |
| spa-ita | newssyscomb2009 | 0.61632 | 33.3 | 502 | 11551 |
| fra-spa | news-test2008 | 0.58939 | 33.9 | 2051 | 52586 |
| spa-fra | news-test2008 | 0.58695 | 32.4 | 2051 | 52685 |
| fra-ita | newstest2009 | 0.59764 | 31.2 | 2525 | 63466 |
| fra-spa | newstest2009 | 0.58829 | 32.5 | 2525 | 68111 |
| ita-fra | newstest2009 | 0.59084 | 31.6 | 2525 | 69263 |
| ita-spa | newstest2009 | 0.59669 | 33.5 | 2525 | 68111 |
| spa-fra | newstest2009 | 0.59096 | 32.3 | 2525 | 69263 |
| spa-ita | newstest2009 | 0.60783 | 33.2 | 2525 | 63466 |
| fra-spa | newstest2010 | 0.62250 | 37.8 | 2489 | 65480 |
| spa-fra | newstest2010 | 0.61953 | 36.2 | 2489 | 66022 |
| fra-spa | newstest2011 | 0.62953 | 39.8 | 3003 | 79476 |
| spa-fra | newstest2011 | 0.61130 | 34.9 | 3003 | 80626 |
| fra-spa | newstest2012 | 0.62397 | 39.0 | 3003 | 79006 |
| spa-fra | newstest2012 | 0.60927 | 34.3 | 3003 | 78011 |
| fra-spa | newstest2013 | 0.59312 | 34.9 | 3000 | 70528 |
| spa-fra | newstest2013 | 0.59468 | 33.6 | 3000 | 70037 |
| cat-ita | wmt21-ml-wp | 0.69968 | 47.8 | 1743 | 42735 |
| cat-oci | wmt21-ml-wp | 0.73808 | 51.6 | 1743 | 43736 |
| cat-ron | wmt21-ml-wp | 0.51178 | 29.0 | 1743 | 42895 |
| ita-cat | wmt21-ml-wp | 0.70538 | 48.9 | 1743 | 43833 |
| ita-oci | wmt21-ml-wp | 0.59025 | 32.0 | 1743 | 43736 |
| ita-ron | wmt21-ml-wp | 0.51261 | 28.9 | 1743 | 42895 |
| oci-cat | wmt21-ml-wp | 0.80908 | 66.1 | 1743 | 43833 |
| oci-ita | wmt21-ml-wp | 0.63584 | 39.6 | 1743 | 42735 |
| oci-ron | wmt21-ml-wp | 0.47384 | 24.6 | 1743 | 42895 |
| ron-cat | wmt21-ml-wp | 0.52994 | 31.1 | 1743 | 43833 |
| ron-ita | wmt21-ml-wp | 0.52714 | 29.6 | 1743 | 42735 |
| ron-oci | wmt21-ml-wp | 0.45932 | 21.3 | 1743 | 43736 |
@inproceedings{tiedemann-thottingal-2020-opus,
title = "{OPUS}-{MT} {--} Building open translation services for the World",
author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
month = nov,
year = "2020",
address = "Lisboa, Portugal",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2020.eamt-1.61",
pages = "479--480",
}
@inproceedings{tiedemann-2020-tatoeba,
title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.wmt-1.139",
pages = "1174--1182",
}
该工作得到 European Language Grid 的支持, pilot project 2866 在 FoTran project 的支持下,欧洲研究理事会(ERC)在欧盟的2020年地平线研究和创新计划(批准号771113)下资助,以及欧洲联盟的2020年地平线研究和创新计划下的资助780069。我们还感谢 CSC -- IT Center for Science ,芬兰提供的慷慨的计算资源和 IT 基础设施。