AfriBERTa large是一个预训练的多语言模型,具有大约1.26亿个参数。该模型有10层,6个注意力头,768个隐藏单元和3072个前馈网络大小。该模型是在11种非洲语言上进行预训练的,包括阿法尔语(也称为奥罗莫语)、阿姆哈拉语、加胡扎语(一种混合语言,包含基尼亚卢旺达语和基伦迪语)、豪萨语、伊博语、尼日利亚皮钦语、索马里语、斯瓦希里语、提格利尼亚语和约鲁巴语。该模型在包括非预训练的非洲语言在内的多个任务上表现出有竞争力的下游性能,包括文本分类和命名实体识别。
您可以使用此模型与Transformers库结合进行任何下游任务。例如,假设我们想将该模型微调用于标记分类任务,我们可以按如下步骤操作:
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_large")
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_large")
# we have to manually set the model max length because it is an imported sentencepiece model, which huggingface does not properly support right now
>>> tokenizer.model_max_length = 512
限制和偏差
该模型的训练数据是由BBC新闻网站和Common Crawl的数据集聚合而成。
有关训练过程的详细信息,请参阅AfriBERTa文档。
@inproceedings{ogueji-etal-2021-small,
title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
author = "Ogueji, Kelechi and
Zhu, Yuxin and
Lin, Jimmy",
booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.mrl-1.11",
pages = "116--126",
}