Small-E-Czech

Small-E-Czech 是一个在 Seznam 创建的捷克网页语料库上预训练的 Electra 小型模型，并在 IAAI 2022 paper 中介绍。与其他预训练模型一样，使用前需要在感兴趣的下游任务上进行微调。在 Seznam.cz，它已经帮助提升了 web search ranking 、查询的错别字纠正和点击诱导标题的检测。我们以 CC BY 4.0 license 的方式发布它（即允许商业使用）。如果要提出问题，请访问我们的 github 网站。

如何在 transformers 中使用鉴别器

from transformers import ElectraForPreTraining, ElectraTokenizerFast
import torch

discriminator = ElectraForPreTraining.from_pretrained("Seznam/small-e-czech")
tokenizer = ElectraTokenizerFast.from_pretrained("Seznam/small-e-czech")

sentence = "Za hory, za doly, mé zlaté parohy"
fake_sentence = "Za hory, za doly, kočka zlaté parohy"

fake_sentence_tokens = ["[CLS]"] + tokenizer.tokenize(fake_sentence) + ["[SEP]"]
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
outputs = discriminator(fake_inputs)
predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()

for token in fake_sentence_tokens:
    print("{:>7s}".format(token), end="")
print()

for prediction in predictions.squeeze():
    print("{:7.1f}".format(prediction), end="")
print()

在输出中，我们可以看到根据鉴别器的概率，特定标记不属于该句子（即由生成器伪造）：

  [CLS]     za   hory      ,     za    dol    ##y      ,  kočka  zlaté   paro   ##hy  [SEP]
    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.8    0.3    0.2    0.1    0.0

微调

有关如何在新任务上对模型进行微调的说明，请参阅官方的 HuggingFace transformers tutorial 。

作者:

Seznam.cz

数据集大小:

104.6 MB