数据集:
Muennighoff/xP3x
xP3x(跨语言公共提示池扩展版)是一个覆盖277种语言和16个NLP任务的提示和数据集收集。它包含了所有xP3的内容,以及更多!它用于训练Aya项目@ C4AI 未来的mT0和BLOOMZ竞争者。
| Name | Explanation | Example models |
|---|---|---|
| 1236321 | Mixture of 17 tasks in 277 languages with English prompts | WIP - Join us at Project Aya @ 1237321 to help! |
| 1238321 | Mixture of 13 training tasks in 46 languages with English prompts | 1239321 & 12310321 |
| 12311321 | Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) | 12312321 & 12313321 |
| 12314321 | xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts | |
| 12315321 | 12316321 processed version of xP3 | 1239321 |
| 12318321 | Repreprocessed version of the English-only 12319321 with 8 training tasks | 12320321 & 12321321 |
示例如下:
{
'inputs': '11月、遂にクロームはファイヤーフォックスを引き離し始めた。_はインターネットユーザーの評価が高まったのだ。\nReplace the _ in the above sentence with the correct option: \n- ファイヤーフォックス\n- クローム',
'targets': 'クローム',
'language': 'jpn_Jpan',
'split': 'test',
'template': 'Replace',
'dataset': 'Muennighoff/xwinograd',
'config': 'jp'
}
数据字段在所有拆分中都相同:
该数据集有680GB和5.3亿个样本。根据需求进行筛选和去重。
按语言加载:
# pip install -q datasets
from datasets import load_dataset
ds = load_dataset("Muennighoff/xP3x", "zho_Hans", streaming=True) # Use streaming to not download all at once
for x in ds["train"]:
print(x)
break
然后,您可以通过数据字段进行筛选,例如仅获取特定的配置或数据集。由于每个数据集-配置-模板都是一个单独的jsonl文件,因此您也可以根据您想要的数据集、配置和模板进行决定并仅下载它们。例如,要下载所有日语xwinograd样本,您可以执行:
# pip install -q datasets
from datasets import load_dataset
import multiprocessing
# pip install --upgrade huggingface-hub
from huggingface_hub import HfFileSystem, hf_hub_url
fs = HfFileSystem()
fps = fs.glob(f"datasets/Muennighoff/xP3x/data/jpn_Jpan/*xwinograd*")
resolved_paths = [fs.resolve_path(file) for file in fps]
data_files = [hf_hub_url(resolved_path.repo_id, resolved_path.path_in_repo, repo_type=resolved_path.repo_type) for resolved_path in resolved_paths]
ds = load_dataset("json", data_files=data_files, num_proc=8)["train"]
| Language | Code | Kilobytes | % | Samples | % |
|---|---|---|---|---|---|
| Emilian | egl_Latn | 104 | 0.0 | 402 | 0.0 |
| Swiss German | gsw_Latn | 104 | 0.0 | 408 | 0.0 |
| Novial | nov_Latn | 116 | 0.0 | 432 | 0.0 |
| Ainu (Latin script) | ain_Latn | 120 | 0.0 | 410 | 0.0 |
| Chamorro | cha_Latn | 120 | 0.0 | 452 | 0.0 |
| Gothic | got_Goth | 120 | 0.0 | 402 | 0.0 |
| Prussian | prg_Latn | 120 | 0.0 | 424 | 0.0 |
| Picard | pcd_Latn | 140 | 0.0 | 530 | 0.0 |
| Northern Frisian | frr_Latn | 156 | 0.0 | 554 | 0.0 |
| Uzbek (Latin script) | uzb_Latn | 156 | 0.0 | 600 | 0.0 |
| Ottoman Turkish (Latin script) | ota_Latn | 188 | 0.0 | 632 | 0.0 |
| Swahili (macrolanguage) | swa_Latn | 212 | 0.0 | 772 | 0.0 |
| Talossan | tzl_Latn | 220 | 0.0 | 836 | 0.0 |
| Kven Finnish | fkv_Latn | 260 | 0.0 | 910 | 0.0 |
| Zaza | zza_Latn | 260 | 0.0 | 1,056 | 0.0 |
| Frisian | fry_Latn | 268 | 0.0 | 956 | 0.0 |
| Piemontese | pms_Latn | 276 | 0.0 | 998 | 0.0 |
| Kalmyk | xal_Cyrl | 288 | 0.0 | 976 | 0.0 |
| Hunsrik | hrx_Latn | 352 | 0.0 | 1,380 | 0.0 |
| Romany | rom_Latn | 364 | 0.0 | 1,410 | 0.0 |
| Ancient Greek (to 1453) | grc_Grek | 392 | 0.0 | 1,226 | 0.0 |
| Tase Naga | nst_Latn | 424 | 0.0 | 1,608 | 0.0 |
| Albanian | sqi_Latn | 596 | 0.0 | 2,216 | 0.0 |
| Guadeloupean Creole French | gcf_Latn | 608 | 0.0 | 2,326 | 0.0 |
| Yakut | sah_Cyrl | 608 | 0.0 | 1,986 | 0.0 |
| Ho (Latin script) | hoc_Latn | 632 | 0.0 | 2,634 | 0.0 |
| Khasi | kha_Latn | 676 | 0.0 | 2,664 | 0.0 |
| Algerian Arabic | arq_Arab | 688 | 0.0 | 2,278 | 0.0 |
| Lower Sorbian | dsb_Latn | 692 | 0.0 | 2,596 | 0.0 |
| Chuvash | chv_Cyrl | 716 | 0.0 | 2,446 | 0.0 |
| Old Russian | orv_Cyrl | 752 | 0.0 | 2,586 | 0.0 |
| Pampanga | pam_Latn | 784 | 0.0 | 2,984 | 0.0 |
| Kurdish (Latin script) | kur_Latn | 796 | 0.0 | 3,050 | 0.0 |
| Ottoman Turkish | ota_Arab | 832 | 0.0 | 2,772 | 0.0 |
| Kotava | avk_Latn | 864 | 0.0 | 3,118 | 0.0 |
| Upper Sorbian | hsb_Latn | 900 | 0.0 | 3,474 | 0.0 |
| Buryat | bua_Cyrl | 924 | 0.0 | 3,218 | 0.0 |
| Swabian | swg_Latn | 996 | 0.0 | 3,366 | 0.0 |
| Coastal Kadazan | kzj_Latn | 1,136 | 0.0 | 3,766 | 0.0 |
| Chavacano | cbk_Latn | 1,352 | 0.0 | 4,994 | 0.0 |
| Quechua | que_Latn | 1,704 | 0.0 | 5,312 | 0.0 |
| Lingua Franca Nova (Cyrillic script) | lfn_Cyrl | 1,740 | 0.0 | 5,458 | 0.0 |
| Gronings | gos_Latn | 1,864 | 0.0 | 7,462 | 0.0 |
| Volapük | vol_Latn | 1,948 | 0.0 | 7,712 | 0.0 |
| Yue Chinese (Simplified) | yue_Hans | 2,300 | 0.0 | 7,872 | 0.0 |
| Mari (Russia) | chm_Cyrl | 2,540 | 0.0 | 7,496 | 0.0 |
| Kadazan Dusun | dtp_Latn | 2,548 | 0.0 | 8,892 | 0.0 |
| Breton | bre_Latn | 3,048 | 0.0 | 11,868 | 0.0 |
| Ladino | lad_Latn | 3,224 | 0.0 | 11,916 | 0.0 |
| Cornish | cor_Latn | 3,492 | 0.0 | 13,880 | 0.0 |
| Interlingue | ile_Latn | 3,700 | 0.0 | 14,468 | 0.0 |
| Wu Chinese | wuu_Hans | 3,784 | 0.0 | 13,062 | 0.0 |
| Japanese (Katakana) | jpn_Kana | 4,208 | 0.0 | 13,942 | 0.0 |
| Ido | ido_Latn | 6,180 | 0.0 | 23,742 | 0.0 |
| Yiddishi | yid_Hebr | 9,896 | 0.0 | 34,412 | 0.01 |
| Klingon | tlh_Latn | 11,716 | 0.0 | 46,010 | 0.01 |
| Lingua Franca Nova | lfn_Latn | 13,328 | 0.0 | 46,826 | 0.01 |
| Lojban | jbo_Latn | 17,468 | 0.0 | 66,694 | 0.01 |
| Low German | nds_Latn | 18,364 | 0.0 | 68,098 | 0.01 |
| Interlingua (International Auxiliary Language Association) | ina_Latn | 25,700 | 0.0 | 76,584 | 0.01 |
| Java | java | 25,904 | 0.0 | 13,551 | 0.0 |
| Japanese (Kanji) | jpn_Hani | 26,292 | 0.0 | 89,978 | 0.02 |
| Norwegian | nor_Latn | 26,724 | 0.0 | 93,116 | 0.02 |
| Toki Pona | toki_Latn | 26,808 | 0.0 | 97,170 | 0.02 |
| Latin | lat_Latn | 28,900 | 0.0 | 101,390 | 0.02 |
| Serbo-Croatian | hbs_Latn | 29,452 | 0.0 | 105,748 | 0.02 |
| Nigerian Pidgin | pcm_Latn | 145,872 | 0.02 | 88,992 | 0.02 |
| Azerbaijani (South or North; Latin script) | aze_Latn | 147,564 | 0.02 | 77,875 | 0.01 |
| Serbian (Latin script) | srp_Latn | 179,072 | 0.03 | 131,101 | 0.02 |
| Japanese (Hiragana) | jpn_Hira | 188,944 | 0.03 | 628,758 | 0.12 |
| Berber (Latin script) | ber_Latn | 201,464 | 0.03 | 693,602 | 0.13 |
| Jupyter Notebook | jupyter_notebook | 416,056 | 0.06 | 400,000 | 0.08 |
| Yue Chinese | yue_Hant | 613,352 | 0.09 | 1,227,429 | 0.23 |
| Haitian Creole | hat_Latn | 629,420 | 0.09 | 1,228,281 | 0.23 |
| Mossi | mos_Latn | 630,416 | 0.09 | 1,223,481 | 0.23 |
| Pangasinan | pag_Latn | 630,684 | 0.09 | 1,223,481 | 0.23 |
| Twi | twi_Latn | 631,172 | 0.09 | 1,223,481 | 0.23 |
| Bosnian | bos_Latn | 633,016 | 0.09 | 1,224,479 | 0.23 |
| Ewe | ewe_Latn | 633,292 | 0.09 | 1,223,481 | 0.23 |
| Bambara | bam_Latn | 634,520 | 0.09 | 1,223,481 | 0.23 |
| Javanese | jav_Latn | 635,248 | 0.09 | 1,224,003 | 0.23 |
| Southwestern Dinka | dik_Latn | 635,416 | 0.09 | 1,223,481 | 0.23 |
| Kabuverdianu | kea_Latn | 636,144 | 0.09 | 1,223,481 | 0.23 |
| Dyula | dyu_Latn | 636,464 | 0.09 | 1,223,481 | 0.23 |
| Venetian | vec_Latn | 637,412 | 0.09 | 1,223,481 | 0.23 |
| Chokwe | cjk_Latn | 637,532 | 0.09 | 1,223,481 | 0.23 |
| Latgalian | ltg_Latn | 637,612 | 0.09 | 1,223,481 | 0.23 |
| Sundanese | sun_Latn | 638,120 | 0.09 | 1,223,481 | 0.23 |
| Asturian | ast_Latn | 638,708 | 0.09 | 1,223,481 | 0.23 |
| Akan | aka_Latn | 639,648 | 0.09 | 1,223,481 | 0.23 |
| Mizo | lus_Latn | 639,680 | 0.09 | 1,223,481 | 0.23 |
| Guarani | grn_Latn | 641,540 | 0.09 | 1,225,647 | 0.23 |
| Limburgish | lim_Latn | 642,368 | 0.09 | 1,223,481 | 0.23 |
| Faroese | fao_Latn | 642,432 | 0.09 | 1,224,067 | 0.23 |
| Buginese | bug_Latn | 643,472 | 0.09 | 1,223,481 | 0.23 |
| Sango | sag_Latn | 643,596 | 0.09 | 1,223,481 | 0.23 |
| Luba-Kasai | lua_Latn | 643,640 | 0.09 | 1,223,481 | 0.23 |
| Papiamento | pap_Latn | 643,648 | 0.09 | 1,223,481 | 0.23 |
| Silesian | szl_Latn | 644,608 | 0.09 | 1,223,481 | 0.23 |
| Sicilian | scn_Latn | 645,636 | 0.1 | 1,223,481 | 0.23 |
| Kimbundu | kmb_Latn | 645,964 | 0.1 | 1,223,481 | 0.23 |
| Basque | eus_Latn | 646,084 | 0.1 | 1,246,877 | 0.23 |
| Balinese | ban_Latn | 646,408 | 0.1 | 1,223,481 | 0.23 |
| Norwegian Nynorsk | nno_Latn | 646,996 | 0.1 | 1,229,699 | 0.23 |
| Central Aymara | ayr_Latn | 647,236 | 0.1 | 1,223,481 | 0.23 |
| Tamasheq (Latin script) | taq_Latn | 648,656 | 0.1 | 1,223,481 | 0.23 |
| Kikongo | kon_Latn | 648,992 | 0.1 | 1,223,481 | 0.23 |
| Friulian | fur_Latn | 649,272 | 0.1 | 1,223,481 | 0.23 |
| Ayacucho Quechua | quy_Latn | 649,992 | 0.1 | 1,223,481 | 0.23 |
| Maori | mri_Latn | 650,336 | 0.1 | 1,224,211 | 0.23 |
| Icelandic | isl_Latn | 650,372 | 0.1 | 1,246,623 | 0.23 |
| Galician | glg_Latn | 652,088 | 0.1 | 1,233,291 | 0.23 |
| Catalan | cat_Latn | 652,116 | 0.1 | 1,241,381 | 0.23 |
| Lombard | lmo_Latn | 652,120 | 0.1 | 1,223,481 | 0.23 |
| Banjar (Latin script) | bjn_Latn | 652,372 | 0.1 | 1,223,481 | 0.23 |
| Fijian | fij_Latn | 652,796 | 0.1 | 1,223,481 | 0.23 |
| Crimean Tatar | crh_Latn | 653,920 | 0.1 | 1,223,895 | 0.23 |
| Northern Kurdish | kmr_Latn | 654,108 | 0.1 | 1,223,481 | 0.23 |
| Ligurian | lij_Latn | 654,432 | 0.1 | 1,223,481 | 0.23 |
| Occitan | oci_Latn | 655,676 | 0.1 | 1,227,945 | 0.23 |
| Turkmen | tuk_Latn | 658,672 | 0.1 | 1,241,205 | 0.23 |
| Luxembourgish | ltz_Latn | 658,768 | 0.1 | 1,225,339 | 0.23 |
| Cebuano | ceb_Latn | 659,124 | 0.1 | 1,226,039 | 0.23 |
| Samoan | smo_Latn | 659,704 | 0.1 | 1,223,481 | 0.23 |
| Sardinian | srd_Latn | 660,000 | 0.1 | 1,223,481 | 0.23 |
| Bemba | bem_Latn | 660,504 | 0.1 | 1,223,481 | 0.23 |
| Minangkabau (Latin script) | min_Latn | 660,672 | 0.1 | 1,223,481 | 0.23 |
| Acehnese (Latin script) | ace_Latn | 661,084 | 0.1 | 1,223,481 | 0.23 |
| Ilocano | ilo_Latn | 661,184 | 0.1 | 1,227,663 | 0.23 |
| Irish | gle_Latn | 661,660 | 0.1 | 1,227,357 | 0.23 |
| Fon | fon_Latn | 663,124 | 0.1 | 1,223,481 | 0.23 |
| Waray | war_Latn | 664,120 | 0.1 | 1,226,503 | 0.23 |
| Norwegian Bokmål | nob_Latn | 666,240 | 0.1 | 1,300,607 | 0.24 |
| Tosk Albanian | als_Latn | 666,692 | 0.1 | 1,223,481 | 0.23 |
| Standard Malay | zsm_Latn | 667,088 | 0.1 | 1,270,715 | 0.24 |
| Southern Sotho | sot_Latn | 667,728 | 0.1 | 1,223,481 | 0.23 |
| Kabyle | kab_Latn | 668,128 | 0.1 | 1,346,605 | 0.25 |
| Jingpho | kac_Latn | 669,464 | 0.1 | 1,223,481 | 0.23 |
| Lingala | lin_Latn | 670,428 | 0.1 | 1,323,481 | 0.25 |
| Wolof | wol_Latn | 670,568 | 0.1 | 1,373,481 | 0.26 |
| Central Kanuri (Latin script) | knc_Latn | 670,800 | 0.1 | 1,223,481 | 0.23 |
| Kikuyu | kik_Latn | 672,096 | 0.1 | 1,223,481 | 0.23 |
| Tok Pisin | tpi_Latn | 672,916 | 0.1 | 1,223,481 | 0.23 |
| Nuer | nus_Latn | 673,632 | 0.1 | 1,223,481 | 0.23 |
| Tagalog | tgl_Latn | 673,684 | 0.1 | 1,247,417 | 0.23 |
| Tumbuka | tum_Latn | 676,948 | 0.1 | 1,223,481 | 0.23 |
| Plateau Malagasy | plt_Latn | 677,852 | 0.1 | 1,223,481 | 0.23 |
| Afrikaans | afr_Latn | 679,164 | 0.1 | 1,337,091 | 0.25 |
| North Azerbaijani | azj_Latn | 679,820 | 0.1 | 1,223,481 | 0.23 |
| Kabiyè | kbp_Latn | 684,880 | 0.1 | 1,223,481 | 0.23 |
| Modern Standard Arabic (Romanized) | arb_Latn | 685,408 | 0.1 | 1,223,481 | 0.23 |
| Scottish Gaelic | gla_Latn | 708,620 | 0.1 | 1,243,627 | 0.23 |
| Sindhi | snd_Arab | 718,680 | 0.11 | 1,223,481 | 0.23 |
| North Levantine Arabic | apc_Arab | 720,048 | 0.11 | 1,223,481 | 0.23 |
| Tunisian Arabic | aeb_Arab | 720,360 | 0.11 | 1,223,481 | 0.23 |
| South Levantine Arabic | ajp_Arab | 720,488 | 0.11 | 1,223,481 | 0.23 |
| Dari | prs_Arab | 720,500 | 0.11 | 1,223,481 | 0.23 |
| Moroccan Arabic | ary_Arab | 722,904 | 0.11 | 1,223,481 | 0.23 |
| Egyptian Arabic | arz_Arab | 723,356 | 0.11 | 1,223,481 | 0.23 |
| Najdi Arabic | ars_Arab | 725,784 | 0.11 | 1,223,481 | 0.23 |
| Acehnese (Arabic script) | ace_Arab | 726,272 | 0.11 | 1,223,481 | 0.23 |
| Mesopotamian Arabic | acm_Arab | 728,472 | 0.11 | 1,223,481 | 0.23 |
| Ta’izzi-Adeni Arabic | acq_Arab | 734,780 | 0.11 | 1,223,481 | 0.23 |
| South Azerbaijani | azb_Arab | 735,728 | 0.11 | 1,223,481 | 0.23 |
| Central Kanuri (Arabic script) | knc_Arab | 746,936 | 0.11 | 1,223,481 | 0.23 |
| Rundi | run_Latn | 749,792 | 0.11 | 1,296,111 | 0.24 |
| Banjar (Arabic script) | bjn_Arab | 751,112 | 0.11 | 1,223,481 | 0.23 |
| Central Kurdish | ckb_Arab | 756,804 | 0.11 | 1,223,481 | 0.23 |
| Bashkir | bak_Cyrl | 758,816 | 0.11 | 1,223,481 | 0.23 |
| Kashmiri (Arabic script) | kas_Arab | 759,140 | 0.11 | 1,223,481 | 0.23 |
| Tatar | tat_Cyrl | 764,212 | 0.11 | 1,247,685 | 0.23 |
| Minangkabau (Arabic script) | min_Arab | 765,384 | 0.11 | 1,223,481 | 0.23 |
| Kazakh | kaz_Cyrl | 766,176 | 0.11 | 1,232,697 | 0.23 |
| Halh Mongolian | khk_Cyrl | 776,384 | 0.11 | 1,224,353 | 0.23 |
| Tajik | tgk_Cyrl | 780,452 | 0.11 | 1,223,481 | 0.23 |
| Eastern Yiddish | ydd_Hebr | 781,452 | 0.12 | 1,223,481 | 0.23 |
| Uyghur | uig_Arab | 785,444 | 0.12 | 1,256,999 | 0.24 |
| Armenian | hye_Armn | 789,952 | 0.12 | 1,228,171 | 0.23 |
| Hebrew | heb_Hebr | 793,144 | 0.12 | 1,604,365 | 0.3 |
| Belarusian | bel_Cyrl | 806,588 | 0.12 | 1,261,197 | 0.24 |
| Macedonian | mkd_Cyrl | 813,436 | 0.12 | 1,384,567 | 0.26 |
| Welsh | cym_Latn | 821,036 | 0.12 | 1,321,455 | 0.25 |
| Northern Uzbek | uzn_Latn | 835,560 | 0.12 | 1,273,404 | 0.24 |
| Central Atlas Tamazight | tzm_Tfng | 843,508 | 0.12 | 1,223,481 | 0.23 |
| Tamasheq (Tifinagh script) | taq_Tfng | 848,104 | 0.12 | 1,223,481 | 0.23 |
| Magahi | mag_Deva | 851,360 | 0.13 | 1,223,481 | 0.23 |
| Bhojpuri | bho_Deva | 854,848 | 0.13 | 1,223,481 | 0.23 |
| Awadhi | awa_Deva | 857,096 | 0.13 | 1,224,037 | 0.23 |
| Chhattisgarhi | hne_Deva | 859,332 | 0.13 | 1,223,481 | 0.23 |
| Kyrgyz | kir_Cyrl | 860,700 | 0.13 | 1,250,163 | 0.23 |
| Maithili | mai_Deva | 863,476 | 0.13 | 1,223,481 | 0.23 |
| Assamese | asm_Beng | 865,904 | 0.13 | 1,223,481 | 0.23 |
| Kashmiri (Devanagari script) | kas_Deva | 867,232 | 0.13 | 1,223,481 | 0.23 |
| Sanskrit | san_Deva | 879,236 | 0.13 | 1,223,481 | 0.23 |
| Lao | lao_Laoo | 888,240 | 0.13 | 1,223,481 | 0.23 |
| Odia | ory_Orya | 890,508 | 0.13 | 1,223,481 | 0.23 |
| Santali | sat_Olck | 902,300 | 0.13 | 1,223,481 | 0.23 |
| Kannada | kan_Knda | 909,260 | 0.13 | 1,223,481 | 0.23 |
| Meitei (Bengali script) | mni_Beng | 917,984 | 0.14 | 1,223,481 | 0.23 |
| Georgian | kat_Geor | 928,712 | 0.14 | 1,226,729 | 0.23 |
| Kamba | kam_Latn | 936,468 | 0.14 | 2,136,615 | 0.4 |
| Tigrinya | tir_Ethi | 949,608 | 0.14 | 1,276,536 | 0.24 |
| Swati | ssw_Latn | 950,564 | 0.14 | 2,195,002 | 0.41 |
| Malayalam | mal_Mlym | 953,984 | 0.14 | 1,225,083 | 0.23 |
| Nigerian Fulfulde | fuv_Latn | 956,328 | 0.14 | 2,126,652 | 0.4 |
| Umbundu | umb_Latn | 974,104 | 0.14 | 2,264,553 | 0.43 |
| Ganda | lug_Latn | 975,780 | 0.14 | 2,273,481 | 0.43 |
| Northern Sotho | nso_Latn | 978,484 | 0.14 | 2,250,971 | 0.42 |
| Khmer | khm_Khmr | 984,756 | 0.14 | 1,227,825 | 0.23 |
| Luo | luo_Latn | 993,068 | 0.15 | 2,249,242 | 0.42 |
| Standard Tibetan | bod_Tibt | 993,732 | 0.15 | 1,223,481 | 0.23 |
| Tswana | tsn_Latn | 1,009,328 | 0.15 | 2,323,481 | 0.44 |
| Kinyarwanda | kin_Latn | 1,010,752 | 0.15 | 2,273,481 | 0.43 |
| Sinhala | sin_Sinh | 1,012,012 | 0.15 | 1,256,582 | 0.24 |
| Xhosa | xho_Latn | 1,019,804 | 0.15 | 2,323,481 | 0.44 |
| Shona | sna_Latn | 1,026,320 | 0.15 | 2,273,481 | 0.43 |
| Esperanto | epo_Latn | 1,029,444 | 0.15 | 2,612,083 | 0.49 |
| Tsonga | tso_Latn | 1,031,856 | 0.15 | 2,323,481 | 0.44 |
| Dzongkha | dzo_Tibt | 1,033,552 | 0.15 | 1,223,481 | 0.23 |
| Zulu | zul_Latn | 1,039,296 | 0.15 | 2,323,481 | 0.44 |
| Serbian | srp_Cyrl | 1,040,024 | 0.15 | 1,362,598 | 0.26 |
| Nyanja | nya_Latn | 1,061,780 | 0.16 | 2,323,481 | 0.44 |
| Shan | shn_Mymr | 1,074,940 | 0.16 | 1,223,481 | 0.23 |
| Igbo | ibo_Latn | 1,095,300 | 0.16 | 2,282,301 | 0.43 |
| Hausa | hau_Latn | 1,112,272 | 0.16 | 2,335,738 | 0.44 |
| West Central Oromo | gaz_Latn | 1,115,600 | 0.16 | 2,343,260 | 0.44 |
| Nepali | npi_Deva | 1,144,676 | 0.17 | 1,281,430 | 0.24 |
| Yoruba | yor_Latn | 1,164,540 | 0.17 | 2,334,801 | 0.44 |
| Southern Pashto | pbt_Arab | 1,170,840 | 0.17 | 1,365,533 | 0.26 |
| Somali | som_Latn | 1,198,320 | 0.18 | 2,482,437 | 0.47 |
| Burmese | mya_Mymr | 1,228,196 | 0.18 | 1,279,882 | 0.24 |
| Amharic | amh_Ethi | 1,261,128 | 0.19 | 1,980,215 | 0.37 |
| Eastern Panjabi | pan_Guru | 1,305,636 | 0.19 | 1,307,897 | 0.25 |
| Gujarati | guj_Gujr | 1,331,780 | 0.2 | 1,317,314 | 0.25 |
| Marathi | mar_Deva | 1,494,024 | 0.22 | 1,443,950 | 0.27 |
| Bengali | ben_Beng | 1,650,272 | 0.24 | 1,411,514 | 0.27 |
| Chinese (Traditional) | zho_Hant | 1,778,736 | 0.26 | 1,956,189 | 0.37 |
| Tamil | tam_Taml | 1,833,328 | 0.27 | 1,394,473 | 0.26 |
| Swahili | swh_Latn | 1,970,784 | 0.29 | 4,185,608 | 0.79 |
| Telugu | tel_Telu | 2,224,480 | 0.33 | 1,573,325 | 0.3 |
| Ukrainian | ukr_Cyrl | 2,227,616 | 0.33 | 2,216,119 | 0.42 |
| Western Persian | pes_Arab | 2,389,340 | 0.35 | 1,811,121 | 0.34 |
| Turkish | tur_Latn | 3,106,600 | 0.46 | 4,146,153 | 0.78 |
| Urdu | urd_Arab | 3,553,960 | 0.52 | 3,513,218 | 0.66 |
| Korean | kor_Hang | 4,642,468 | 0.68 | 3,415,920 | 0.64 |
| Python | python | 4,728,504 | 0.7 | 3,142,962 | 0.59 |
| Japanese | jpn_Jpan | 5,079,788 | 0.75 | 4,193,570 | 0.79 |
| Thai | tha_Thai | 6,860,704 | 1.01 | 4,666,299 | 0.88 |
| Chinese (Simplified) | zho_Hans | 8,063,684 | 1.19 | 7,355,509 | 1.38 |
| Vietnamese | vie_Latn | 8,398,824 | 1.24 | 6,194,925 | 1.16 |
| Indonesian | ind_Latn | 9,380,144 | 1.38 | 5,301,812 | 1.0 |
| Hindi | hin_Deva | 9,914,328 | 1.46 | 5,612,176 | 1.05 |
| Croatian | hrv_Latn | 10,028,028 | 1.48 | 5,583,975 | 1.05 |
| Modern Standard Arabic | arb_Arab | 11,051,064 | 1.63 | 7,232,551 | 1.36 |
| Romanian | ron_Latn | 11,441,636 | 1.68 | 5,594,927 | 1.05 |
| Maltese | mlt_Latn | 11,614,488 | 1.71 | 5,513,885 | 1.04 |
| Slovenian | slv_Latn | 12,014,912 | 1.77 | 5,533,689 | 1.04 |
| Estonian | est_Latn | 12,126,212 | 1.79 | 5,584,057 | 1.05 |
| Lithuanian | lit_Latn | 12,253,976 | 1.8 | 5,603,047 | 1.05 |
| Slovak | slk_Latn | 12,286,300 | 1.81 | 5,513,481 | 1.04 |
| Standard Latvian | lvs_Latn | 12,298,584 | 1.81 | 5,517,287 | 1.04 |
| Polish | pol_Latn | 12,409,684 | 1.83 | 5,868,631 | 1.1 |
| Hungarian | hun_Latn | 12,607,420 | 1.86 | 6,086,621 | 1.14 |
| Russian | rus_Cyrl | 13,110,908 | 1.93 | 8,798,927 | 1.65 |
| Czech | ces_Latn | 14,316,052 | 2.11 | 6,418,462 | 1.21 |
| Bulgarian | bul_Cyrl | 14,615,468 | 2.15 | 7,265,885 | 1.37 |
| Swedish | swe_Latn | 14,646,656 | 2.16 | 5,634,363 | 1.06 |
| Finnish | fin_Latn | 15,011,464 | 2.21 | 6,077,501 | 1.14 |
| Danish | dan_Latn | 16,136,612 | 2.38 | 5,831,109 | 1.1 |
| Dutch | nld_Latn | 22,387,020 | 3.3 | 8,992,864 | 1.69 |
| Greek | ell_Grek | 23,144,296 | 3.41 | 7,224,001 | 1.36 |
| Italian | ita_Latn | 23,952,824 | 3.53 | 9,967,738 | 1.87 |
| Portuguese | por_Latn | 27,297,252 | 4.02 | 11,242,808 | 2.11 |
| German | deu_Latn | 27,909,808 | 4.11 | 15,806,969 | 2.97 |
| French | fra_Latn | 28,428,608 | 4.18 | 16,365,984 | 3.08 |
| Spanish | spa_Latn | 30,969,580 | 4.56 | 16,315,928 | 3.07 |
| English | eng_Latn | 69,530,384 | 10.24 | 53,015,690 | 9.96 |
| Total | - | 679,318,704 | 100 | 532,107,156 | 100 |
数据集收集在Apache 2.0下发布。请注意,各个数据集可能具有不同的许可证。
@article{muennighoff2022crosslingual,
title={Crosslingual generalization through multitask finetuning},
author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others},
journal={arXiv preprint arXiv:2211.01786},
year={2022}
}
感谢 promptsource 的贡献者为该数据集添加了许多提示。感谢Aya团队@ C4AI 🧡