数据集:
masakhane/masakhanews
任务:
子任务:
topic-classification计算机处理:
multilingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
MasakhaNEWS 是非洲广泛使用的16种语言中最大的公开可用的新闻主题分类数据集。
全部16种语言的训练/验证/测试集均可获得。
[需要更多信息]
提供16种语言:
尤鲁巴语的示例如下所示:
from datasets import load_dataset
data = load_dataset('masakhane/masakhanews', 'yor') 
# Please, specify the language code
# A data point example is below:
{
'label': 0, 
'headline': "'The barriers to entry have gone - go for it now'", 
'text': "j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 
'headline_text': "'The barriers to entry have gone - go for it now' j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 
'url': '/news/business-61880859'
}
 新闻主题对应于此列表:
"business", "entertainment", "health", "politics", "religion", "sports", "technology"
对于所有语言,有三个拆分。
原始拆分的名称为“train”、“dev”和“test”,它们分别对应“训练”、“验证”和“测试”拆分。
拆分的大小如下所示:
| Language | train | validation | test | 
|---|---|---|---|
| Amharic | 1311 | 188 | 376 | 
| English | 3309 | 472 | 948 | 
| French | 1476 | 211 | 422 | 
| Hausa | 2219 | 317 | 637 | 
| Igbo | 1356 | 194 | 390 | 
| Lingala | 608 | 87 | 175 | 
| Luganda | 771 | 110 | 223 | 
| Oromo | 1015 | 145 | 292 | 
| Nigerian-Pidgin | 1060 | 152 | 305 | 
| Rundi | 1117 | 159 | 322 | 
| chiShona | 1288 | 185 | 369 | 
| Somali | 1021 | 148 | 294 | 
| Kiswahili | 1658 | 237 | 476 | 
| Tigrinya | 947 | 137 | 272 | 
| isiXhosa | 1032 | 147 | 297 | 
| Yoruba | 1433 | 206 | 411 | 
引入该数据集是为了为自然语言处理下的20种少资源语言提供新的资源。
[需要更多信息]
数据来源于新闻领域,详细信息可在此处找到****
初始数据收集和规范化文章进行了词级标记,但目前无法获取有关确切预处理流程的信息。
源语言制作者是谁?源语言由上述新闻机构和报纸雇佣的记者和作家创作。
可在此处找到详细信息**
谁是注释者?注释者来自 Masakhane
数据来源于报纸资源,只包含公众人物或个体的提及
[需要更多信息]
[需要更多信息]
用户应注意,数据集仅包含新闻文本,这可能限制开发系统在其他领域的适用性。
数据的许可状态为CC 4.0 Non-Commercial
提供数据集的格式化引用,例如:
@article{Adelani2023MasakhaNEWS,
  title={MasakhaNEWS: News Topic Classification for African languages},
  author={David Ifeoluwa Adelani and  Marek Masiak and  Israel Abebe Azime and  Jesujoba Oluwadara Alabi and  Atnafu Lambebo Tonja and  Christine Mwase and  Odunayo Ogundepo and  Bonaventure F. P. Dossou and  Akintunde Oladipo and  Doreen Nixdorf and  Chris Chinenye Emezue and  Sana Sabah al-azzawi and  Blessing K. Sibanda and  Davis David and  Lolwethu Ndolela and  Jonathan Mukiibi and  Tunde Oluwaseyi Ajayi and  Tatiana Moteu Ngoli and  Brian Odhiambo and  Abraham Toluwase Owodunni and  Nnaemeka C. Obiefuna and  Shamsuddeen Hassan Muhammad and  Saheed Salahudeen Abdullahi and  Mesay Gemeda Yigezu and  Tajuddeen Gwadabe and  Idris Abdulmumin and  Mahlet Taye Bame and  Oluwabusayo Olufunke Awoyomi and  Iyanuoluwa Shode and  Tolulope Anu Adelani and  Habiba Abdulganiy Kailani and  Abdul-Hakeem Omotayo and  Adetola Adeeko and  Afolabi Abeeb and  Anuoluwapo Aremu and  Olanrewaju Samuel and  Clemencia Siro and  Wangari Kimotho and  Onyekachi Raphael Ogbu and  Chinedu E. Mbonu and  Chiamaka I. Chukwuneke and  Samuel Fanijo and  Jessica Ojo and  Oyinkansola F. Awosan and  Tadesse Kebede Guge and  Sakayo Toadoum Sari and  Pamela Nyatsine and  Freedmore Sidume and  Oreen Yousuf and  Mardiyyah Oduwole and  Ussen Kimanuka and  Kanda Patrick Tshinu and  Thina Diko and  Siyanda Nxakama and   Abdulmejid Tuni Johar and  Sinodos Gebre and  Muhidin Mohamed and  Shafie Abdi Mohamed and  Fuad Mire Hassan and  Moges Ahmed Mehamed and  Evrard Ngabire and  and Pontus Stenetorp},
  journal={ArXiv},
  year={2023},
  volume={}
}
 感谢 @dadelani 添加了此数据集。