英文

opus-mt-tc-big-itc-itc

目录

  • 模型详情
  • 用途
  • 风险、限制和偏见
  • 如何开始使用该模型
  • 训练
  • 评估
  • 引用信息
  • 致谢

模型详情

这是一种神经机器翻译模型,用于将意大利语言(itc)翻译为意大利语言(itc)。

该模型是 OPUS-MT project 的一部分,该项目旨在使神经机器翻译模型在世界上许多语言中得到广泛应用和普及。所有模型都是使用惊人的 Marian NMT 框架进行训练的,该框架是用纯粹的 C++ 编写的高效 NMT 实现。这些模型已经使用 huggingface 的 transformers 库转换为 pyTorch。训练数据来自 OPUS ,训练流程使用 OPUS-MT-train 的流程。

模型描述:

  • 开发者:赫尔辛基大学语言技术研究组
  • 模型类型:翻译(transformer-big)
  • 发布日期:2022-08-10
  • 许可证:CC-BY-4.0
  • 语言:
    • 源语言:ast cat cbk fra fro glg hat ita lad lad_Latn lat lat_Latn lij lld oci pms por ron spa
    • 目标语言:ast cat fra gcf glg hat ita lad lad_Latn lat lat_Latn oci por ron spa
    • 语言对:ast-cat ast-fra ast-glg ast-ita ast-oci ast-por ast-ron ast-spa cat-ast cat-fra cat-glg cat-ita cat-oci cat-por cat-ron cat-spa fra-ast fra-cat fra-glg fra-ita fra-oci fra-por fra-ron fra-spa glg-ast glg-cat glg-fra glg-ita glg-oci glg-por glg-ron glg-spa ita-ast ita-cat ita-fra ita-glg ita-oci ita-por ita-ron ita-spa lad-spa lad_Latn-spa oci-ast oci-cat oci-fra oci-glg oci-ita oci-por oci-ron oci-spa pms-ita por-ast por-cat por-fra por-glg por-ita por-oci por-ron por-spa ron-ast ron-cat ron-fra ron-glg ron-ita ron-oci ron-por ron-spa spa-cat spa-fra spa-glg spa-ita spa-por spa-ron
    • 有效的目标语言标签:>>acf<< >>aoa<< >>arg<< >>ast<< >>cat<< >>cbk<< >>cbk_Latn<< >>ccd<< >>cks<< >>cos<< >>cri<< >>crs<< >>dlm<< >>drc<< >>egl<< >>ext<< >>fab<< >>fax<< >>fra<< >>frc<< >>frm<< >>frm_Latn<< >>fro<< >>fro_Latn<< >>frp<< >>fur<< >>fur_Latn<< >>gcf<< >>gcf_Latn<< >>gcr<< >>glg<< >>hat<< >>idb<< >>ist<< >>ita<< >>itk<< >>kea<< >>kmv<< >>lad<< >>lad_Latn<< >>lat<< >>lat_Grek<< >>lat_Latn<< >>lij<< >>lld<< >>lld_Latn<< >>lmo<< >>lou<< >>mcm<< >>mfe<< >>mol<< >>mwl<< >>mxi<< >>mzs<< >>nap<< >>nrf<< >>oci<< >>osc<< >>osp<< >>osp_Latn<< >>pap<< >>pcd<< >>pln<< >>pms<< >>pob<< >>por<< >>pov<< >>pre<< >>pro<< >>qbb<< >>qhr<< >>rcf<< >>rgn<< >>roh<< >>ron<< >>ruo<< >>rup<< >>ruq<< >>scf<< >>scn<< >>sdc<< >>sdn<< >>spa<< >>spq<< >>spx<< >>src<< >>srd<< >>sro<< >>tmg<< >>tvy<< >>vec<< >>vkp<< >>wln<< >>xfa<< >>xum<<
  • 原始模型: opusTCv20210807_transformer-big_2022-08-10.zip
  • 更多信息资源:

这是一个具有多个目标语言的多语种翻译模型。需要以 >>id<lt; 的形式提供一个句子开头的语言标记(id = 有效的目标语言ID),例如 >>ast<lt;

用途

该模型可用于翻译和文本生成。

风险、限制和偏见

内容警告:读者应意识到该模型是基于可能包含令人不安、冒犯和可能传播历史和现实刻板印象的各种公共数据集进行训练的。

已经有大量的研究探讨了语言模型的偏见和公平性问题(参见,例如, Sheng et al. (2021) Bender et al. (2021) )。

如何开始使用该模型

一个简短的示例代码:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>fra<< Charras anglés?",
    ">>fra<< Vull veure't."
]

model_name = "pytorch-models/opus-mt-tc-big-itc-itc"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Conversations anglaises ?
#     Je veux te voir.

您还可以使用 transformers pipelines 来使用 OPUS-MT 模型,例如:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-itc-itc")
print(pipe(">>fra<< Charras anglés?"))

# expected output: Conversations anglaises ?

训练

评估

langpair testset chr-F BLEU #sent #words
cat-fra tatoeba-test-v2021-08-07 0.71201 54.6 700 5664
cat-ita tatoeba-test-v2021-08-07 0.74198 58.4 298 2028
cat-por tatoeba-test-v2021-08-07 0.74930 57.4 747 6119
cat-spa tatoeba-test-v2021-08-07 0.87844 78.1 1534 12094
fra-cat tatoeba-test-v2021-08-07 0.66525 46.2 700 5342
fra-ita tatoeba-test-v2021-08-07 0.72742 53.8 10091 62060
fra-por tatoeba-test-v2021-08-07 0.68413 48.6 10518 77650
fra-ron tatoeba-test-v2021-08-07 0.65009 44.0 1925 12252
fra-spa tatoeba-test-v2021-08-07 0.72080 54.8 10294 78406
glg-por tatoeba-test-v2021-08-07 0.76720 61.1 433 3105
glg-spa tatoeba-test-v2021-08-07 0.82362 71.7 2121 17443
ita-cat tatoeba-test-v2021-08-07 0.72529 56.4 298 2109
ita-fra tatoeba-test-v2021-08-07 0.77932 65.2 10091 66377
ita-por tatoeba-test-v2021-08-07 0.72798 54.0 3066 25668
ita-ron tatoeba-test-v2021-08-07 0.70814 51.1 1005 6209
ita-spa tatoeba-test-v2021-08-07 0.77455 62.9 5000 34937
lad_Latn-spa tatoeba-test-v2021-08-07 0.59363 42.6 239 1239
lad-spa tatoeba-test-v2021-08-07 0.52243 34.7 276 1448
oci-fra tatoeba-test-v2021-08-07 0.49660 29.6 806 6302
pms-ita tatoeba-test-v2021-08-07 0.40221 20.0 232 1721
por-cat tatoeba-test-v2021-08-07 0.71146 52.2 747 6149
por-fra tatoeba-test-v2021-08-07 0.75565 60.9 10518 80459
por-glg tatoeba-test-v2021-08-07 0.75348 59.0 433 3016
por-ita tatoeba-test-v2021-08-07 0.76883 58.8 3066 24897
por-ron tatoeba-test-v2021-08-07 0.67838 46.6 681 4521
por-spa tatoeba-test-v2021-08-07 0.79336 64.8 10947 87335
ron-fra tatoeba-test-v2021-08-07 0.70307 55.0 1925 13347
ron-ita tatoeba-test-v2021-08-07 0.73862 53.7 1005 6352
ron-por tatoeba-test-v2021-08-07 0.70889 50.7 681 4593
ron-spa tatoeba-test-v2021-08-07 0.73529 57.2 1959 12679
spa-cat tatoeba-test-v2021-08-07 0.82758 67.9 1534 12343
spa-fra tatoeba-test-v2021-08-07 0.73113 57.3 10294 83501
spa-glg tatoeba-test-v2021-08-07 0.77332 63.0 2121 16581
spa-ita tatoeba-test-v2021-08-07 0.77046 60.3 5000 34515
spa-lad_Latn tatoeba-test-v2021-08-07 0.40084 14.7 239 1254
spa-por tatoeba-test-v2021-08-07 0.75854 59.1 10947 87610
spa-ron tatoeba-test-v2021-08-07 0.66679 45.5 1959 12503
ast-cat flores101-devtest 0.57870 31.8 1012 27304
ast-fra flores101-devtest 0.56761 31.1 1012 28343
ast-glg flores101-devtest 0.55161 27.9 1012 26582
ast-ita flores101-devtest 0.51764 22.1 1012 27306
ast-oci flores101-devtest 0.49545 20.6 1012 27305
ast-por flores101-devtest 0.57347 31.5 1012 26519
ast-ron flores101-devtest 0.52317 24.8 1012 26799
ast-spa flores101-devtest 0.49741 21.2 1012 29199
cat-ast flores101-devtest 0.56754 24.7 1012 24572
cat-fra flores101-devtest 0.63368 38.4 1012 28343
cat-glg flores101-devtest 0.59596 32.2 1012 26582
cat-ita flores101-devtest 0.55886 26.3 1012 27306
cat-oci flores101-devtest 0.54285 24.6 1012 27305
cat-por flores101-devtest 0.62913 37.7 1012 26519
cat-ron flores101-devtest 0.56885 29.5 1012 26799
cat-spa flores101-devtest 0.53372 24.6 1012 29199
fra-ast flores101-devtest 0.52696 20.7 1012 24572
fra-cat flores101-devtest 0.60492 34.6 1012 27304
fra-glg flores101-devtest 0.57485 30.3 1012 26582
fra-ita flores101-devtest 0.56493 27.3 1012 27306
fra-oci flores101-devtest 0.57449 28.2 1012 27305
fra-por flores101-devtest 0.62211 36.9 1012 26519
fra-ron flores101-devtest 0.56998 29.4 1012 26799
fra-spa flores101-devtest 0.52880 24.2 1012 29199
glg-ast flores101-devtest 0.55090 22.4 1012 24572
glg-cat flores101-devtest 0.60550 32.6 1012 27304
glg-fra flores101-devtest 0.62026 36.0 1012 28343
glg-ita flores101-devtest 0.55834 25.9 1012 27306
glg-oci flores101-devtest 0.52520 21.9 1012 27305
glg-por flores101-devtest 0.60027 32.7 1012 26519
glg-ron flores101-devtest 0.55621 27.8 1012 26799
glg-spa flores101-devtest 0.53219 24.4 1012 29199
ita-ast flores101-devtest 0.50741 17.1 1012 24572
ita-cat flores101-devtest 0.57061 27.9 1012 27304
ita-fra flores101-devtest 0.60199 32.0 1012 28343
ita-glg flores101-devtest 0.55312 25.9 1012 26582
ita-oci flores101-devtest 0.48447 18.1 1012 27305
ita-por flores101-devtest 0.58162 29.0 1012 26519
ita-ron flores101-devtest 0.53703 24.2 1012 26799
ita-spa flores101-devtest 0.52238 23.1 1012 29199
oci-ast flores101-devtest 0.53010 20.2 1012 24572
oci-cat flores101-devtest 0.59946 32.2 1012 27304
oci-fra flores101-devtest 0.64290 39.0 1012 28343
oci-glg flores101-devtest 0.56737 28.0 1012 26582
oci-ita flores101-devtest 0.54220 24.2 1012 27306
oci-por flores101-devtest 0.62127 35.7 1012 26519
oci-ron flores101-devtest 0.55906 28.0 1012 26799
oci-spa flores101-devtest 0.52110 22.8 1012 29199
por-ast flores101-devtest 0.54539 22.5 1012 24572
por-cat flores101-devtest 0.61809 36.4 1012 27304
por-fra flores101-devtest 0.64343 39.7 1012 28343
por-glg flores101-devtest 0.57965 30.4 1012 26582
por-ita flores101-devtest 0.55841 26.3 1012 27306
por-oci flores101-devtest 0.54829 25.3 1012 27305
por-ron flores101-devtest 0.57283 29.8 1012 26799
por-spa flores101-devtest 0.53513 25.2 1012 29199
ron-ast flores101-devtest 0.52265 20.1 1012 24572
ron-cat flores101-devtest 0.59689 32.6 1012 27304
ron-fra flores101-devtest 0.63060 37.4 1012 28343
ron-glg flores101-devtest 0.56677 29.3 1012 26582
ron-ita flores101-devtest 0.55485 25.6 1012 27306
ron-oci flores101-devtest 0.52433 21.8 1012 27305
ron-por flores101-devtest 0.61831 36.1 1012 26519
ron-spa flores101-devtest 0.52712 24.1 1012 29199
spa-ast flores101-devtest 0.49008 15.7 1012 24572
spa-cat flores101-devtest 0.53905 23.2 1012 27304
spa-fra flores101-devtest 0.57078 27.4 1012 28343
spa-glg flores101-devtest 0.52563 22.0 1012 26582
spa-ita flores101-devtest 0.52783 22.3 1012 27306
spa-oci flores101-devtest 0.48064 16.3 1012 27305
spa-por flores101-devtest 0.55736 25.8 1012 26519
spa-ron flores101-devtest 0.51623 21.4 1012 26799
fra-ita newssyscomb2009 0.60995 32.1 502 11551
fra-spa newssyscomb2009 0.60224 34.2 502 12503
ita-fra newssyscomb2009 0.61237 33.7 502 12331
ita-spa newssyscomb2009 0.60706 35.4 502 12503
spa-fra newssyscomb2009 0.61290 34.6 502 12331
spa-ita newssyscomb2009 0.61632 33.3 502 11551
fra-spa news-test2008 0.58939 33.9 2051 52586
spa-fra news-test2008 0.58695 32.4 2051 52685
fra-ita newstest2009 0.59764 31.2 2525 63466
fra-spa newstest2009 0.58829 32.5 2525 68111
ita-fra newstest2009 0.59084 31.6 2525 69263
ita-spa newstest2009 0.59669 33.5 2525 68111
spa-fra newstest2009 0.59096 32.3 2525 69263
spa-ita newstest2009 0.60783 33.2 2525 63466
fra-spa newstest2010 0.62250 37.8 2489 65480
spa-fra newstest2010 0.61953 36.2 2489 66022
fra-spa newstest2011 0.62953 39.8 3003 79476
spa-fra newstest2011 0.61130 34.9 3003 80626
fra-spa newstest2012 0.62397 39.0 3003 79006
spa-fra newstest2012 0.60927 34.3 3003 78011
fra-spa newstest2013 0.59312 34.9 3000 70528
spa-fra newstest2013 0.59468 33.6 3000 70037
cat-ita wmt21-ml-wp 0.69968 47.8 1743 42735
cat-oci wmt21-ml-wp 0.73808 51.6 1743 43736
cat-ron wmt21-ml-wp 0.51178 29.0 1743 42895
ita-cat wmt21-ml-wp 0.70538 48.9 1743 43833
ita-oci wmt21-ml-wp 0.59025 32.0 1743 43736
ita-ron wmt21-ml-wp 0.51261 28.9 1743 42895
oci-cat wmt21-ml-wp 0.80908 66.1 1743 43833
oci-ita wmt21-ml-wp 0.63584 39.6 1743 42735
oci-ron wmt21-ml-wp 0.47384 24.6 1743 42895
ron-cat wmt21-ml-wp 0.52994 31.1 1743 43833
ron-ita wmt21-ml-wp 0.52714 29.6 1743 42735
ron-oci wmt21-ml-wp 0.45932 21.3 1743 43736

引用信息

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

致谢

该工作得到 European Language Grid 的支持, pilot project 2866 FoTran project 的支持下,欧洲研究理事会(ERC)在欧盟的2020年地平线研究和创新计划(批准号771113)下资助,以及欧洲联盟的2020年地平线研究和创新计划下的资助780069。我们还感谢 CSC -- IT Center for Science ,芬兰提供的慷慨的计算资源和 IT 基础设施。

模型转换信息

  • transformers 版本:4.16.2
  • OPUS-MT git 哈希:8b9f0b0
  • 转换时间:2022年8月12日23:57:49 EEST
  • 转换机器:LM0-400-22516.local