数据集:
wikiann
任务:
计算机处理:
multilingual大小:
n<1K语言创建人:
crowdsourced批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:1902.00193许可:
WikiANN(有时称为PAN-X)是一个多语言命名实体识别数据集,由用IOB2格式标注的维基百科文章组成,其中包含LOC(位置)、PER(人物)和ORG(组织)标签。此版本对应于Rahimi等人(2019)的平衡训练、开发和测试集,支持原始WikiANN语料库中的176种语言中的176种。
数据集包含176种语言,每个配置子集中的一种语言。对应的BCP 47语言标签为:
| Language tag | |
|---|---|
| ace | ace | 
| af | af | 
| als | als | 
| am | am | 
| an | an | 
| ang | ang | 
| ar | ar | 
| arc | arc | 
| arz | arz | 
| as | as | 
| ast | ast | 
| ay | ay | 
| az | az | 
| ba | ba | 
| bar | bar | 
| be | be | 
| bg | bg | 
| bh | bh | 
| bn | bn | 
| bo | bo | 
| br | br | 
| bs | bs | 
| ca | ca | 
| cdo | cdo | 
| ce | ce | 
| ceb | ceb | 
| ckb | ckb | 
| co | co | 
| crh | crh | 
| cs | cs | 
| csb | csb | 
| cv | cv | 
| cy | cy | 
| da | da | 
| de | de | 
| diq | diq | 
| dv | dv | 
| el | el | 
| en | en | 
| eo | eo | 
| es | es | 
| et | et | 
| eu | eu | 
| ext | ext | 
| fa | fa | 
| fi | fi | 
| fo | fo | 
| fr | fr | 
| frr | frr | 
| fur | fur | 
| fy | fy | 
| ga | ga | 
| gan | gan | 
| gd | gd | 
| gl | gl | 
| gn | gn | 
| gu | gu | 
| hak | hak | 
| he | he | 
| hi | hi | 
| hr | hr | 
| hsb | hsb | 
| hu | hu | 
| hy | hy | 
| ia | ia | 
| id | id | 
| ig | ig | 
| ilo | ilo | 
| io | io | 
| is | is | 
| it | it | 
| ja | ja | 
| jbo | jbo | 
| jv | jv | 
| ka | ka | 
| kk | kk | 
| km | km | 
| kn | kn | 
| ko | ko | 
| ksh | ksh | 
| ku | ku | 
| ky | ky | 
| la | la | 
| lb | lb | 
| li | li | 
| lij | lij | 
| lmo | lmo | 
| ln | ln | 
| lt | lt | 
| lv | lv | 
| mg | mg | 
| mhr | mhr | 
| mi | mi | 
| min | min | 
| mk | mk | 
| ml | ml | 
| mn | mn | 
| mr | mr | 
| ms | ms | 
| mt | mt | 
| mwl | mwl | 
| my | my | 
| mzn | mzn | 
| nap | nap | 
| nds | nds | 
| ne | ne | 
| nl | nl | 
| nn | nn | 
| no | no | 
| nov | nov | 
| oc | oc | 
| or | or | 
| os | os | 
| other-bat-smg | sgs | 
| other-be-x-old | be-tarask | 
| other-cbk-zam | cbk | 
| other-eml | eml | 
| other-fiu-vro | vro | 
| other-map-bms | jv-x-bms | 
| other-simple | en-basiceng | 
| other-zh-classical | lzh | 
| other-zh-min-nan | nan | 
| other-zh-yue | yue | 
| pa | pa | 
| pdc | pdc | 
| pl | pl | 
| pms | pms | 
| pnb | pnb | 
| ps | ps | 
| pt | pt | 
| qu | qu | 
| rm | rm | 
| ro | ro | 
| ru | ru | 
| rw | rw | 
| sa | sa | 
| sah | sah | 
| scn | scn | 
| sco | sco | 
| sd | sd | 
| sh | sh | 
| si | si | 
| sk | sk | 
| sl | sl | 
| so | so | 
| sq | sq | 
| sr | sr | 
| su | su | 
| sv | sv | 
| sw | sw | 
| szl | szl | 
| ta | ta | 
| te | te | 
| tg | tg | 
| th | th | 
| tk | tk | 
| tl | tl | 
| tr | tr | 
| tt | tt | 
| ug | ug | 
| uk | uk | 
| ur | ur | 
| uz | uz | 
| vec | vec | 
| vep | vep | 
| vi | vi | 
| vls | vls | 
| vo | vo | 
| wa | wa | 
| war | war | 
| wuu | wuu | 
| xmf | xmf | 
| yi | yi | 
| yo | yo | 
| zea | zea | 
| zh | zh | 
这是“训练”集中“af”(南非语)配置子集的示例:
{
  'tokens': ['Sy', 'ander', 'seun', ',', 'Swjatopolk', ',', 'was', 'die', 'resultaat', 'van', '’n', 'buite-egtelike', 'verhouding', '.'],
  'ner_tags': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'langs': ['af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af'],
  'spans': ['PER: Swjatopolk']
}
 对于每个配置子集,数据被拆分为“训练”、“验证”和“测试”集,每个集合包含以下数量的示例:
| Train | Validation | Test | |
|---|---|---|---|
| ace | 100 | 100 | 100 | 
| af | 5000 | 1000 | 1000 | 
| als | 100 | 100 | 100 | 
| am | 100 | 100 | 100 | 
| an | 1000 | 1000 | 1000 | 
| ang | 100 | 100 | 100 | 
| ar | 20000 | 10000 | 10000 | 
| arc | 100 | 100 | 100 | 
| arz | 100 | 100 | 100 | 
| as | 100 | 100 | 100 | 
| ast | 1000 | 1000 | 1000 | 
| ay | 100 | 100 | 100 | 
| az | 10000 | 1000 | 1000 | 
| ba | 100 | 100 | 100 | 
| bar | 100 | 100 | 100 | 
| bat-smg | 100 | 100 | 100 | 
| be | 15000 | 1000 | 1000 | 
| be-x-old | 5000 | 1000 | 1000 | 
| bg | 20000 | 10000 | 10000 | 
| bh | 100 | 100 | 100 | 
| bn | 10000 | 1000 | 1000 | 
| bo | 100 | 100 | 100 | 
| br | 1000 | 1000 | 1000 | 
| bs | 15000 | 1000 | 1000 | 
| ca | 20000 | 10000 | 10000 | 
| cbk-zam | 100 | 100 | 100 | 
| cdo | 100 | 100 | 100 | 
| ce | 100 | 100 | 100 | 
| ceb | 100 | 100 | 100 | 
| ckb | 1000 | 1000 | 1000 | 
| co | 100 | 100 | 100 | 
| crh | 100 | 100 | 100 | 
| cs | 20000 | 10000 | 10000 | 
| csb | 100 | 100 | 100 | 
| cv | 100 | 100 | 100 | 
| cy | 10000 | 1000 | 1000 | 
| da | 20000 | 10000 | 10000 | 
| de | 20000 | 10000 | 10000 | 
| diq | 100 | 100 | 100 | 
| dv | 100 | 100 | 100 | 
| el | 20000 | 10000 | 10000 | 
| eml | 100 | 100 | 100 | 
| en | 20000 | 10000 | 10000 | 
| eo | 15000 | 10000 | 10000 | 
| es | 20000 | 10000 | 10000 | 
| et | 15000 | 10000 | 10000 | 
| eu | 10000 | 10000 | 10000 | 
| ext | 100 | 100 | 100 | 
| fa | 20000 | 10000 | 10000 | 
| fi | 20000 | 10000 | 10000 | 
| fiu-vro | 100 | 100 | 100 | 
| fo | 100 | 100 | 100 | 
| fr | 20000 | 10000 | 10000 | 
| frr | 100 | 100 | 100 | 
| fur | 100 | 100 | 100 | 
| fy | 1000 | 1000 | 1000 | 
| ga | 1000 | 1000 | 1000 | 
| gan | 100 | 100 | 100 | 
| gd | 100 | 100 | 100 | 
| gl | 15000 | 10000 | 10000 | 
| gn | 100 | 100 | 100 | 
| gu | 100 | 100 | 100 | 
| hak | 100 | 100 | 100 | 
| he | 20000 | 10000 | 10000 | 
| hi | 5000 | 1000 | 1000 | 
| hr | 20000 | 10000 | 10000 | 
| hsb | 100 | 100 | 100 | 
| hu | 20000 | 10000 | 10000 | 
| hy | 15000 | 1000 | 1000 | 
| ia | 100 | 100 | 100 | 
| id | 20000 | 10000 | 10000 | 
| ig | 100 | 100 | 100 | 
| ilo | 100 | 100 | 100 | 
| io | 100 | 100 | 100 | 
| is | 1000 | 1000 | 1000 | 
| it | 20000 | 10000 | 10000 | 
| ja | 20000 | 10000 | 10000 | 
| jbo | 100 | 100 | 100 | 
| jv | 100 | 100 | 100 | 
| ka | 10000 | 10000 | 10000 | 
| kk | 1000 | 1000 | 1000 | 
| km | 100 | 100 | 100 | 
| kn | 100 | 100 | 100 | 
| ko | 20000 | 10000 | 10000 | 
| ksh | 100 | 100 | 100 | 
| ku | 100 | 100 | 100 | 
| ky | 100 | 100 | 100 | 
| la | 5000 | 1000 | 1000 | 
| lb | 5000 | 1000 | 1000 | 
| li | 100 | 100 | 100 | 
| lij | 100 | 100 | 100 | 
| lmo | 100 | 100 | 100 | 
| ln | 100 | 100 | 100 | 
| lt | 10000 | 10000 | 10000 | 
| lv | 10000 | 10000 | 10000 | 
| map-bms | 100 | 100 | 100 | 
| mg | 100 | 100 | 100 | 
| mhr | 100 | 100 | 100 | 
| mi | 100 | 100 | 100 | 
| min | 100 | 100 | 100 | 
| mk | 10000 | 1000 | 1000 | 
| ml | 10000 | 1000 | 1000 | 
| mn | 100 | 100 | 100 | 
| mr | 5000 | 1000 | 1000 | 
| ms | 20000 | 1000 | 1000 | 
| mt | 100 | 100 | 100 | 
| mwl | 100 | 100 | 100 | 
| my | 100 | 100 | 100 | 
| mzn | 100 | 100 | 100 | 
| nap | 100 | 100 | 100 | 
| nds | 100 | 100 | 100 | 
| ne | 100 | 100 | 100 | 
| nl | 20000 | 10000 | 10000 | 
| nn | 20000 | 1000 | 1000 | 
| no | 20000 | 10000 | 10000 | 
| nov | 100 | 100 | 100 | 
| oc | 100 | 100 | 100 | 
| or | 100 | 100 | 100 | 
| os | 100 | 100 | 100 | 
| pa | 100 | 100 | 100 | 
| pdc | 100 | 100 | 100 | 
| pl | 20000 | 10000 | 10000 | 
| pms | 100 | 100 | 100 | 
| pnb | 100 | 100 | 100 | 
| ps | 100 | 100 | 100 | 
| pt | 20000 | 10000 | 10000 | 
| qu | 100 | 100 | 100 | 
| rm | 100 | 100 | 100 | 
| ro | 20000 | 10000 | 10000 | 
| ru | 20000 | 10000 | 10000 | 
| rw | 100 | 100 | 100 | 
| sa | 100 | 100 | 100 | 
| sah | 100 | 100 | 100 | 
| scn | 100 | 100 | 100 | 
| sco | 100 | 100 | 100 | 
| sd | 100 | 100 | 100 | 
| sh | 20000 | 10000 | 10000 | 
| si | 100 | 100 | 100 | 
| simple | 20000 | 1000 | 1000 | 
| sk | 20000 | 10000 | 10000 | 
| sl | 15000 | 10000 | 10000 | 
| so | 100 | 100 | 100 | 
| sq | 5000 | 1000 | 1000 | 
| sr | 20000 | 10000 | 10000 | 
| su | 100 | 100 | 100 | 
| sv | 20000 | 10000 | 10000 | 
| sw | 1000 | 1000 | 1000 | 
| szl | 100 | 100 | 100 | 
| ta | 15000 | 1000 | 1000 | 
| te | 1000 | 1000 | 1000 | 
| tg | 100 | 100 | 100 | 
| th | 20000 | 10000 | 10000 | 
| tk | 100 | 100 | 100 | 
| tl | 10000 | 1000 | 1000 | 
| tr | 20000 | 10000 | 10000 | 
| tt | 1000 | 1000 | 1000 | 
| ug | 100 | 100 | 100 | 
| uk | 20000 | 10000 | 10000 | 
| ur | 20000 | 1000 | 1000 | 
| uz | 1000 | 1000 | 1000 | 
| vec | 100 | 100 | 100 | 
| vep | 100 | 100 | 100 | 
| vi | 20000 | 10000 | 10000 | 
| vls | 100 | 100 | 100 | 
| vo | 100 | 100 | 100 | 
| wa | 100 | 100 | 100 | 
| war | 100 | 100 | 100 | 
| wuu | 100 | 100 | 100 | 
| xmf | 100 | 100 | 100 | 
| yi | 100 | 100 | 100 | 
| yo | 100 | 100 | 100 | 
| zea | 100 | 100 | 100 | 
| zh | 20000 | 10000 | 10000 | 
| zh-classical | 100 | 100 | 100 | 
| zh-min-nan | 100 | 100 | 100 | 
| zh-yue | 20000 | 10000 | 10000 | 
[需要更多信息]
[需要更多信息]
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
标注者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
与此文章相关的原始282个数据集为:
@inproceedings{pan-etal-2017-cross,
    title = "Cross-lingual Name Tagging and Linking for 282 Languages",
    author = "Pan, Xiaoman  and
      Zhang, Boliang  and
      May, Jonathan  and
      Nothman, Joel  and
      Knight, Kevin  and
      Ji, Heng",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P17-1178",
    doi = "10.18653/v1/P17-1178",
    pages = "1946--1958",
    abstract = "The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating {``}silver-standard{''} annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.",
}
 而此版本支持的176种语言与以下文章相关:
@inproceedings{rahimi-etal-2019-massively,
    title = "Massively Multilingual Transfer for {NER}",
    author = "Rahimi, Afshin  and
      Li, Yuan  and
      Cohn, Trevor",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1015",
    pages = "151--164",
}