模型:

distilbert-base-cased

类库:

PyTorch TensorFlow ONNX Transformers

数据集:

bookcorpus wikipedia 3Awikipedia 3Abookcorpus

语言:

其他:

distilbert

预印本库:

arxiv:1910.01108

许可:

apache-2.0

模型介绍文件清单

英文

DistilBERT基本模型（带大小写）的模型卡片

该模型是 BERT base model 的精简版本。它在 this paper 引入。有关蒸馏过程的代码可以在 here 中找到。该模型是有大小写的：它区分英文和English。

所有关于预训练、用途、限制和潜在偏见的培训详细信息（下面包括）与 DistilBERT-base-uncased 相同。如果你想了解更多信息，我们强烈推荐查看它。

模型描述

DistilBERT是一个transformers模型，比BERT更小更快，它以自监督的方式与BERT基础模型相同的语料库进行预训练。这意味着它只是在原始文本上进行预训练，而没有以任何方式对其进行人类标记（这就是为什么它可以使用大量公开可用的数据），使用BERT基础模型从这些文本中生成输入和标签的自动化过程。更确切地说，它用三个目标进行了预训练：

蒸馏损失：该模型被训练为返回与BERT基础模型相同的概率。
掩码语言模型（MLM）：这是BERT基础模型原始训练损失的一部分。在处理一句话时，模型随机掩盖输入中15％的单词，然后通过整个掩盖的句子运行模型，并且必须预测掩盖的单词。这与通常逐个查看单词的传统递归神经网络（RNN）或内部掩盖未来标记的自回归模型（如GPT）不同。它使模型能够学习句子的双向表示。
余弦嵌入损失：模型还被训练为生成与BERT基础模型尽可能接近的隐藏状态。

通过这种方式，该模型学习了与其教师模型相同的英语语言内部表示，同时在推理或下游任务中更快。

预期用途和限制

您可以使用原始模型进行掩码语言建模或下一句预测，但它主要用于在下游任务上进行微调。参见 model hub ，查找您感兴趣的任务的微调版本。

请注意，该模型主要用于对使用整个句子（可能被遮盖）进行决策的任务进行微调，例如序列分类、标记分类或问题回答。对于文本生成之类的任务，则应该查看像GPT2这样的模型。

如何使用

您可以直接使用此模型进行掩码语言建模的流水线：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.05292855575680733,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.03968575969338417,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a business model. [SEP]",
  'score': 0.034743521362543106,
  'token': 2449,
  'token_str': 'business'},
 {'sequence': "[CLS] hello i'm a model model. [SEP]",
  'score': 0.03462274372577667,
  'token': 2944,
  'token_str': 'model'},
 {'sequence': "[CLS] hello i'm a modeling model. [SEP]",
  'score': 0.018145186826586723,
  'token': 11643,
  'token_str': 'modeling'}]

使用PyTorch获取给定文本的特征的示例如下：

from transformers import DistilBertTokenizer, DistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

使用 TensorFlow：

from transformers import DistilBertTokenizer, TFDistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

限制和偏见

即使该模型使用的训练数据可以被视为相当中立的，但该模型可能有偏见的预测。它也继承了 the bias of its teacher model 的一些特点。

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
>>> unmasker("The White man worked as a [MASK].")

[{'sequence': '[CLS] the white man worked as a blacksmith. [SEP]',
  'score': 0.1235365942120552,
  'token': 20987,
  'token_str': 'blacksmith'},
 {'sequence': '[CLS] the white man worked as a carpenter. [SEP]',
  'score': 0.10142576694488525,
  'token': 10533,
  'token_str': 'carpenter'},
 {'sequence': '[CLS] the white man worked as a farmer. [SEP]',
  'score': 0.04985016956925392,
  'token': 7500,
  'token_str': 'farmer'},
 {'sequence': '[CLS] the white man worked as a miner. [SEP]',
  'score': 0.03932540491223335,
  'token': 18594,
  'token_str': 'miner'},
 {'sequence': '[CLS] the white man worked as a butcher. [SEP]',
  'score': 0.03351764753460884,
  'token': 14998,
  'token_str': 'butcher'}]

>>> unmasker("The Black woman worked as a [MASK].")

[{'sequence': '[CLS] the black woman worked as a waitress. [SEP]',
  'score': 0.13283951580524445,
  'token': 13877,
  'token_str': 'waitress'},
 {'sequence': '[CLS] the black woman worked as a nurse. [SEP]',
  'score': 0.12586183845996857,
  'token': 6821,
  'token_str': 'nurse'},
 {'sequence': '[CLS] the black woman worked as a maid. [SEP]',
  'score': 0.11708822101354599,
  'token': 10850,
  'token_str': 'maid'},
 {'sequence': '[CLS] the black woman worked as a prostitute. [SEP]',
  'score': 0.11499975621700287,
  'token': 19215,
  'token_str': 'prostitute'},
 {'sequence': '[CLS] the black woman worked as a housekeeper. [SEP]',
  'score': 0.04722772538661957,
  'token': 22583,
  'token_str': 'housekeeper'}]

这种偏见也会影响到该模型的所有微调版本。

训练数据

DistilBERT在与BERT相同的数据上进行了预训练，该数据是 BookCorpus ，一个由11,038本未发表的书和 English Wikipedia （不包括列表、表和标题）组成的数据集。

训练过程

预处理

文本经过小写和使用WordPiece进行标记化，并使用词汇表大小为30,000。模型的输入形式为：

[CLS] Sentence A [SEP] Sentence B [SEP]

有50%的概率，句子A和句子B对应于原始语料库中的两个连续句子，其他情况下它是语料库中的另一句随机句子。注意，在这里所谓的句子是一段连续的文本，通常比单个句子长。唯一的约束是两个"句子"的结果的长度小于512个标记。

对于每个句子的掩码过程的详细信息如下：

15%的标记被掩盖。
在80%的情况下，被掩盖的标记被[MASK]替换。
在剩下的10%的情况下，被掩盖的标记被与其不同的随机标记替换。
在剩下的10%的情况下，被掩盖的标记保持不变。

预训练

该模型在8个16 GB V100上进行了90小时的训练。有关所有超参数的详细信息，请参见 training code 。

评估结果

在微调下游任务时，该模型实现了以下结果：

Glue测试结果：

Task	MNLI	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE
81.5	87.8	88.2	90.4	47.2	85.5	85.6	60.6

BibTeX条目和引文信息

@article{Sanh2019DistilBERTAD,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.01108}
}

作者:

None

数据集大小:

838.14 MB