pedramyazdipoor/persian_xlm_roberta_large | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

pedramyazdipoor/persian_xlm_roberta_large

预印本库:

arxiv:2202.06219 arxiv:1911.02116

其他:

AutoTrain Compatible xlm-roberta

类库:

Transformers PyTorch

任务:

问答

模型介绍文件清单

英文

波斯XLM-RoBERTA大型问答任务

XLM-RoBERTA是在2.5TB的经过过滤的CommonCrawl数据上进行预训练的多语言语言模型，包含100种语言。它在Conneau等人的论文 Unsupervised Cross-lingual Representation Learning at Scale 中被介绍。

多语言模型 XLM-RoBERTa large for QA on various languages 在各种问答数据集上进行了微调，但尚未在迄今为止最大的波斯问答数据集PQuAD上进行微调。这个第二个模型是我们的基础模型，用于进行微调。

PQuAD数据集的论文介绍： arXiv:2202.06219

简介

该模型在PQuAD训练集上进行了微调，并且非常容易使用。由于训练时间非常长，我决定将这个模型发布出来，以便为需要的人们提供便利。

训练的超参数

由于Google Colab的GPU内存限制，我将批大小设置为4。

batch_size = 4
n_epochs = 1
base_LM_model = "deepset/xlm-roberta-large-squad2"
max_seq_len = 256
learning_rate = 3e-5
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate = 3e-5,
warmup_ratio = 0.1,
gradient_accumulation_steps = 8,
weight_decay = 0.01,

性能

在PQuAD波斯测试集上进行评估，得到了 official PQuAD link 的结果。我也进行了超过1个epochs的训练，但结果变得更差了。我们的XLM-Roberta表现优于 our ParsBert on PQuAD ，但前者的大小是后者的3倍多，所以比较这两者是不公平的。

PQuAD数据集测试集上的问答

Metric	Our XLM-Roberta Large	Our ParsBert
Exact Match	66.56*	47.44
F1	87.31*	81.96

如何使用

Pytorch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
path = 'pedramyazdipoor/persian_xlm_roberta_large'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForQuestionAnswering.from_pretrained(path)

推理

在推理过程中要注意以下几点：

答案的起始索引必须小于结束索引。

答案的范围必须在上下文内。

所选的范围必须是N对候选答案中最有可能的选择。

def generate_indexes(start_logits, end_logits, N, min_index):
  
  output_start = start_logits
  output_end = end_logits

  start_indexes = np.arange(len(start_logits))
  start_probs = output_start
  list_start = dict(zip(start_indexes, start_probs.tolist()))
  end_indexes = np.arange(len(end_logits))
  end_probs = output_end
  list_end = dict(zip(end_indexes, end_probs.tolist()))

  sorted_start_list = sorted(list_start.items(), key=lambda x: x[1], reverse=True) #Descending sort by probability
  sorted_end_list = sorted(list_end.items(), key=lambda x: x[1], reverse=True)

  final_start_idx, final_end_idx = [[] for l in range(2)]

  start_idx, end_idx, prob = 0, 0, (start_probs.tolist()[0] + end_probs.tolist()[0])
  for a in range(0,N):
    for b in range(0,N):
      if (sorted_start_list[a][1] + sorted_end_list[b][1]) > prob :
        if (sorted_start_list[a][0] <= sorted_end_list[b][0]) and (sorted_start_list[a][0] > min_index) :
          prob = sorted_start_list[a][1] + sorted_end_list[b][1]
          start_idx = sorted_start_list[a][0]
          end_idx = sorted_end_list[b][0]
  final_start_idx.append(start_idx)    
  final_end_idx.append(end_idx)      

  return final_start_idx[0], final_end_idx[0]

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval().to(device)
text = 'سلام من پدرامم 26 سالمه'
question = 'چند سالمه؟'
encoding = tokenizer(question,text,add_special_tokens = True,
                     return_token_type_ids = True,
                     return_tensors = 'pt',
                     padding = True,
                     return_offsets_mapping = True,
                     truncation = 'only_first',
                     max_length = 32)
out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device))
#we had to change some pieces of code to make it compatible with one answer generation at a time
#If you have unanswerable questions, use out['start_logits'][0][0:] and out['end_logits'][0][0:] because <s> (the 1st token) is for this situation and must be compared with other tokens.
#you can initialize min_index in generate_indexes() to put force on tokens being chosen to be within the context(startindex must be greater than seperator token).
answer_start_index, answer_end_index = generate_indexes(out['start_logits'][0][1:], out['end_logits'][0][1:], 5, 0)
print(tokenizer.tokenize(text + question))
print(tokenizer.tokenize(text + question)[answer_start_index : (answer_end_index + 1)])
>>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'نام', 'م', '▁چیست', '؟']
>>> ['▁26']

致谢

我们在此感谢 Newsha Shahbodaghkhan 为收集数据集提供便利。

贡献者

Pedram Yazdipoor： Linkedin

发布版本

v0.2发布（2022年9月18日）

这是我们波斯XLM-Roberta-Large的第二个版本。使用之前的版本出现了一些问题。

作者:

pedram Yazdipour

数据集大小:

2.1 GB