这是一种用于巴斯克语的BERT模型,于 BasqueGLUE: A Natural Language Understanding Benchmark for Basque 年推出。
为了训练ElhBERTeu,我们从不同的语料库来源中收集了文字,涵盖了多个领域:更新至2021年的国家和地方新闻来源、巴斯克语维基百科以及来自其他领域(如学术和通俗科学、文学或字幕)的新闻来源和文本。有关所使用的语料库及其大小的更多详细信息,请参见下表。新闻来源的文字进行了过采样(复制)操作,与BERTeus的训练方式相同。总共使用了575M个标记来进行ElhBERTeu的预训练。
Domain | Size |
---|---|
News | 2 x 224M |
Wikipedia | 40M |
Science | 58M |
Literature | 24M |
Others | 7M |
Total | 575M |
ElhBERTeu是一种巴斯克语的基本大小写单语BERT模型,词汇量为50K,总参数数为124M。
这里还有一个中等大小的模型: ElhBERTeu-medium
ElhBERTeu的训练遵循了 BERTeus 的设计决策。分词器和超参数设置保持不变(batch_size=256),唯一的区别在于以512的序列长度在v3-8 TPU上进行了模型的完整预训练(1M步)。
该模型已在最近创建的 BasqueGLUE NLU基准测试中进行了评估。
Model | AVG | NERC | F_intent | F_slot | BHTC | BEC | Vaxx | QNLI | WiC | coref |
---|---|---|---|---|---|---|---|---|---|---|
F1 | F1 | F1 | F1 | F1 | MF1 | acc | acc | acc | ||
BERTeus | 73.23 | 81.92 | 82.52 | 74.34 | 78.26 | 69.43 | 59.30 | 74.26 | 70.71 | 68.31 |
ElhBERTeu | 73.71 | 82.30 | 82.24 | 75.64 | 78.05 | 69.89 | 63.81 | 73.84 | 71.71 | 65.93 |
如果您使用了该模型,请引用以下论文:
@InProceedings{urbizu2022basqueglue, author = {Urbizu, Gorka and San Vicente, Iñaki and Saralegi, Xabier and Agerri, Rodrigo and Soroa, Aitor}, title = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque}, booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {1603--1612}, abstract = {Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.}, url = {https://aclanthology.org/2022.lrec-1.172} }
许可证:CC BY 4.0