英文

opus-mt-tc-big-ar-en

神经机器翻译模型,用于将阿拉伯语(ar)翻译成英语(en)。

这个模型是 OPUS-MT project 的一部分,该项目旨在为世界上许多语言提供广泛的、易于获取的神经机器翻译模型。所有的模型都是使用 Marian NMT 提供的令人惊叹的 C++ 纯实现的高效 NMT_framework 训练得到的。使用 transformers 库由深度情感实现的模型已经转换为 pyTorch。训练数据来自 OPUS ,并且训练流程使用 OPUS-MT-train 的过程。

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

模型信息

使用方法

简单示例代码:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "اتبع قلبك فحسب.",
    "وين راهي دّوش؟"
]

model_name = "pytorch-models/opus-mt-tc-big-ar-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Just follow your heart.
#     Wayne Rahi Dosh?

您也可以使用 transformers pipelines 来使用 OPUS-MT 模型,例如:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-ar-en")
print(pipe("اتبع قلبك فحسب."))

# expected output: Just follow your heart.

基准测试

langpair testset chr-F BLEU #sent #words
ara-eng tatoeba-test-v2021-08-07 0.63477 47.3 10305 76975
ara-eng flores101-devtest 0.66987 42.6 1012 24721
ara-eng tico19-test 0.68521 44.4 2100 56323

致谢

该工作得到 European Language Grid pilot project 2866 的支持,由 FoTran project 资助,该项目受欧洲研究理事会 (ERC) 在欧洲联盟的Horizon 2020研究与创新计划 (合同号 771113) 下进行的高级研究资助,以及 MeMAD project 资助,该项目受欧洲联盟Horizon 2020研究与创新计划 (合同号 780069) 的支持。我们还感谢 CSC -- IT Center for Science 提供的慷慨的计算资源和IT基础设施,芬兰。

模型转换信息

  • transformers 版本: 4.16.2
  • OPUS-MT git 哈希: 3405783
  • 转换时间: Wed Apr 13 18:17:57 EEST 2022
  • 转换机器: LM0-400-22516.local