模型:
Musixmatch/umberto-wikipedia-uncased-v1
UmBERTo 是基于Roberta的语言模型,训练于大型意大利语语料库,并采用了两种创新方法:SentencePiece和Whole Word Masking。现在可在 github.com/huggingface/transformers 处获取。
Marco Lodola,乌贝尔托·埃科纪念碑,阿莱山德里亚2019
UmBERTo-Wikipedia-Uncased训练基于从 Wikipedia-ITA 中提取的相对较小的语料库(约7GB)。
| Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
|---|---|---|---|---|---|---|
| umberto-wikipedia-uncased-v1 | YES | YES | SPM | 32K | 100k | 1236321 |
该模型使用 SentencePiece 和Whole Word Masking进行训练。
这些结果是关于umberto-wikipedia-uncased模型的。所有细节请查看 Umberto 官方页面。
Named Entity Recognition (NER)| Dataset | F1 | Precision | Recall | Accuracy |
|---|---|---|---|---|
| ICAB-EvalITA07 | 86.240 | 85.939 | 86.544 | 98.534 |
| WikiNER-ITA | 90.483 | 90.328 | 90.638 | 98.661 |
| Dataset | F1 | Precision | Recall | Accuracy |
|---|---|---|---|---|
| UD_Italian-ISDT | 98.563 | 98.508 | 98.618 | 98.717 |
| UD_Italian-ParTUT | 97.810 | 97.835 | 97.784 | 98.060 |
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output
预测掩码标记:from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="Musixmatch/umberto-wikipedia-uncased-v1",
tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
)
result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
# {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
# {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
# {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
# {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}
所有原始数据集都是公开可用的,或在所有者的授权下发布。数据集均采用CC0或CCBY许可协议发布。
@inproceedings {magnini2006annotazione,
title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
booktitle = {Proc.of SILFI 2006},
year = {2006}
}
@inproceedings {magnini2006cab,
title = {I - CAB: the Italian Content Annotation Bank.},
author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
booktitle = {LREC},
pages = {963--968},
year = {2006},
organization = {Citeseer}
}
Loreto Parisi:loreto at musixmatch dot com, loretoparisi Simone Francia:simone.francia at musixmatch dot com, simonefrancia Paolo Magnani:paul.magnani95 at gmail dot com, paulthemagno
我们在Musixmatch进行机器学习和人工智能@
musixmatch
关注我们
Twitter
Github