英文

mGENRE

The mGENRE (多语言生成式实体检索)系统是以pytorch实现的,如 Multilingual Autoregressive Entity Linking 所示。

简而言之,mGENRE使用序列到序列的方法进行实体检索(例如链接),基于细调的 mBART 体系结构。GENRE通过使用约束束搜索生成唯一实体名称,取决于输入文本。该模型首先在 facebookresearch/GENRE 库中发布,使用fairseq(通过类似于 this 的转换脚本获取的转换器模型)。

该模型在维基百科的105种语言上进行了训练。

BibTeX条目和引用信息

如果您使用了此存储库中的代码,请考虑引用我们的作品。

@article{decao2020multilingual,
    author = {De Cao, Nicola and Wu, Ledell and Popat, Kashyap and Artetxe, Mikel 
    and Goyal, Naman and Plekhanov, Mikhail and Zettlemoyer, Luke 
    and Cancedda, Nicola and Riedel, Sebastian and Petroni, Fabio},
    title = "{Multilingual Autoregressive Entity Linking}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {10},
    pages = {274-290},
    year = {2022},
    month = {03},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00460},
    url = {https://doi.org/10.1162/tacl\_a\_00460},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00460/2004070/tacl\_a\_00460.pdf},
}

用法

这是一个用于维基百科页面消歧的生成示例:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# OPTIONAL: load the prefix tree (trie), you need to additionally download
# https://huggingface.co/facebook/mgenre-wiki/blob/main/trie.py and 
# https://huggingface.co/facebook/mgenre-wiki/blob/main/titles_lang_all105_trie_with_redirect.pkl
# that is fast but memory inefficient prefix tree (trie) -- it is implemented with nested python `dict`
# NOTE: loading this map may take up to 10 minutes and occupy a lot of RAM!
# import pickle
# from trie import Trie
# with open("titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
#     trie = Trie.load_from_dict(pickle.load(f))

# or a memory efficient but a bit slower prefix tree (trie) -- it is implemented with `marisa_trie` from
# https://huggingface.co/facebook/mgenre-wiki/blob/main/titles_lang_all105_marisa_trie_with_redirect.pkl
# from genre.trie import MarisaTrie
# with open("titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
#     trie = pickle.load(f)

tokenizer = AutoTokenizer.from_pretrained("facebook/mgenre-wiki")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mgenre-wiki").eval()

sentences = ["[START] Einstein [END] era un fisico tedesco."]
# Italian for "[START] Einstein [END] was a German physicist."

outputs = model.generate(
    **tokenizer(sentences, return_tensors="pt"),
    num_beams=5,
    num_return_sequences=5,
    # OPTIONAL: use constrained beam search
    # prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)

tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出以下前5个预测结果(使用约束束搜索):

['Albert Einstein >> it',
 'Albert Einstein (disambiguation) >> en',
 'Alfred Einstein >> it',
 'Alberto Einstein >> it',
 'Einstein >> it']