RoBERTweet模型是基于BERT的,基于大小写的,训练于2008年至2022年之间的所有罗马尼亚推文。
import torch
from transformers import AutoTokenizer, AutoModel
# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("Iulian277/ro-bert-tweet")
model = AutoModel.from_pretrained("Iulian277/ro-bert-tweet")
# Sanitize the input
!pip install emoji
from normalize import normalize # Use the `normalize` function from the `normalize.py` script
normalized_text = normalize("Salut, ce faci?")
# Tokenize the sentence and run through the model
input_ids = torch.tensor(tokenizer.encode(normalized_text, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
# Get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
在使用分词器前,请始终使用存储库中包含的normalize.py脚本对输入文本进行规范化处理,否则由于[UNK]标记,性能会降低。
我们要感谢 TPU Research Cloud 提供的用于预训练RoBERTweet模型所需的TPU计算能力。