T5-small 在 SQuAD v2 上进行微调

Google's T5 (small) 在 SQuAD v2 上进行微调以用于 Q&A 的下游任务。

T5 的详细信息

T5 模型是由 Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li、Peter J. Liu 在 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 中提出的。以下是摘要：

迁移学习是一种先在数据丰富的任务上进行预训练，然后在下游任务上进行微调的技术，在自然语言处理 (NLP) 中得到广泛应用。迁移学习的有效性催生了多种方法、方法论和实践。在本文中，我们通过引入一个将每个语言问题转换为文本到文本格式的统一框架，探索了 NLP 中的迁移学习技术。我们的系统研究比较了预训练目标、架构、无标签数据集、迁移方法和其他因素在数十个语言理解任务上的效果。通过结合我们研究的见解、规模和我们的新的“庞大干净的爬取语料库”，我们在许多涵盖摘要、问答、文本分类等基准测试中取得了最先进的结果。为了促进未来在 NLP 的迁移学习上的工作，我们发布了我们的数据集、预训练模型和代码。

下游任务 (Q&A) 的详细信息 - 数据集 📚 🧐 ❓

数据集标识：squad_v2 来自 Huggingface/NLP

Dataset	Split	# samples
squad_v2	train	130319
squad_v2	valid	11873

如何从 nlp 加载它

train_dataset  = nlp.load_dataset('squad_v2', split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset('squad_v2', split=nlp.Split.VALIDATION)

在 NLP Viewer 中查看有关此数据集和其他数据集的更多信息

模型微调 🏋️‍

训练脚本是根据 this awesome one 由 Suraj Patil 稍作修改的版本

结果 📝

Metric	# Value
EM	69.46
F1	73.01

模型运行示例 🚀

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-small-finetuned-squadv2")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-small-finetuned-squadv2")

def get_answer(question, context):
  input_text = "question: %s  context: %s </s>" % (question, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'])

  return tokenizer.decode(output[0])

context = "Manuel has created RuPERTa-base (a Spanish RoBERTa) with the support of HF-Transformers and Google"
question = "Who has supported Manuel?"

get_answer(question, context)

# output: 'HF-Transformers and Google'

创建者： Manuel Romero/@mrm8488 | LinkedIn

用 ♥ 在西班牙制作

作者:

Manuel Romero

数据集大小:

419.87 MB