数据集:
tner/wikiann
计算机处理:
multilingual任务:
WikiAnn NER 数据集格式化为 TNER 项目的一部分。
日语训练集示例如下。
{
  'tokens': ['#', '#', 'ユ', 'リ', 'ウ', 'ス', '・', 'ベ', 'ー', 'リ', 'ッ', 'ク', '#', '1', '9','9','9'],
  'tags': [6, 6, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6]
}
 标签到ID的映射字典可以在 here 处找到。
{
    "B-LOC": 0,
    "B-ORG": 1,
    "B-PER": 2,
    "I-LOC": 3,
    "I-ORG": 4,
    "I-PER": 5,
    "O": 6
}
 | language | train | validation | test | 
|---|---|---|---|
| ace | 100 | 100 | 100 | 
| bg | 20000 | 10000 | 10000 | 
| da | 20000 | 10000 | 10000 | 
| fur | 100 | 100 | 100 | 
| ilo | 100 | 100 | 100 | 
| lij | 100 | 100 | 100 | 
| mzn | 100 | 100 | 100 | 
| qu | 100 | 100 | 100 | 
| su | 100 | 100 | 100 | 
| vi | 20000 | 10000 | 10000 | 
| af | 5000 | 1000 | 1000 | 
| bh | 100 | 100 | 100 | 
| de | 20000 | 10000 | 10000 | 
| fy | 1000 | 1000 | 1000 | 
| io | 100 | 100 | 100 | 
| lmo | 100 | 100 | 100 | 
| nap | 100 | 100 | 100 | 
| rm | 100 | 100 | 100 | 
| sv | 20000 | 10000 | 10000 | 
| vls | 100 | 100 | 100 | 
| als | 100 | 100 | 100 | 
| bn | 10000 | 1000 | 1000 | 
| diq | 100 | 100 | 100 | 
| ga | 1000 | 1000 | 1000 | 
| is | 1000 | 1000 | 1000 | 
| ln | 100 | 100 | 100 | 
| nds | 100 | 100 | 100 | 
| ro | 20000 | 10000 | 10000 | 
| sw | 1000 | 1000 | 1000 | 
| vo | 100 | 100 | 100 | 
| am | 100 | 100 | 100 | 
| bo | 100 | 100 | 100 | 
| dv | 100 | 100 | 100 | 
| gan | 100 | 100 | 100 | 
| it | 20000 | 10000 | 10000 | 
| lt | 10000 | 10000 | 10000 | 
| ne | 100 | 100 | 100 | 
| ru | 20000 | 10000 | 10000 | 
| szl | 100 | 100 | 100 | 
| wa | 100 | 100 | 100 | 
| an | 1000 | 1000 | 1000 | 
| br | 1000 | 1000 | 1000 | 
| el | 20000 | 10000 | 10000 | 
| gd | 100 | 100 | 100 | 
| ja | 20000 | 10000 | 10000 | 
| lv | 10000 | 10000 | 10000 | 
| nl | 20000 | 10000 | 10000 | 
| rw | 100 | 100 | 100 | 
| ta | 15000 | 1000 | 1000 | 
| war | 100 | 100 | 100 | 
| ang | 100 | 100 | 100 | 
| bs | 15000 | 1000 | 1000 | 
| eml | 100 | 100 | 100 | 
| gl | 15000 | 10000 | 10000 | 
| jbo | 100 | 100 | 100 | 
| map-bms | 100 | 100 | 100 | 
| nn | 20000 | 1000 | 1000 | 
| sa | 100 | 100 | 100 | 
| te | 1000 | 1000 | 1000 | 
| wuu | 100 | 100 | 100 | 
| ar | 20000 | 10000 | 10000 | 
| ca | 20000 | 10000 | 10000 | 
| en | 20000 | 10000 | 10000 | 
| gn | 100 | 100 | 100 | 
| jv | 100 | 100 | 100 | 
| mg | 100 | 100 | 100 | 
| no | 20000 | 10000 | 10000 | 
| sah | 100 | 100 | 100 | 
| tg | 100 | 100 | 100 | 
| xmf | 100 | 100 | 100 | 
| arc | 100 | 100 | 100 | 
| cbk-zam | 100 | 100 | 100 | 
| eo | 15000 | 10000 | 10000 | 
| gu | 100 | 100 | 100 | 
| ka | 10000 | 10000 | 10000 | 
| mhr | 100 | 100 | 100 | 
| nov | 100 | 100 | 100 | 
| scn | 100 | 100 | 100 | 
| th | 20000 | 10000 | 10000 | 
| yi | 100 | 100 | 100 | 
| arz | 100 | 100 | 100 | 
| cdo | 100 | 100 | 100 | 
| es | 20000 | 10000 | 10000 | 
| hak | 100 | 100 | 100 | 
| kk | 1000 | 1000 | 1000 | 
| mi | 100 | 100 | 100 | 
| oc | 100 | 100 | 100 | 
| sco | 100 | 100 | 100 | 
| tk | 100 | 100 | 100 | 
| yo | 100 | 100 | 100 | 
| as | 100 | 100 | 100 | 
| ce | 100 | 100 | 100 | 
| et | 15000 | 10000 | 10000 | 
| he | 20000 | 10000 | 10000 | 
| km | 100 | 100 | 100 | 
| min | 100 | 100 | 100 | 
| or | 100 | 100 | 100 | 
| sd | 100 | 100 | 100 | 
| tl | 10000 | 1000 | 1000 | 
| zea | 100 | 100 | 100 | 
| ast | 1000 | 1000 | 1000 | 
| ceb | 100 | 100 | 100 | 
| eu | 10000 | 10000 | 10000 | 
| hi | 5000 | 1000 | 1000 | 
| kn | 100 | 100 | 100 | 
| mk | 10000 | 1000 | 1000 | 
| os | 100 | 100 | 100 | 
| sh | 20000 | 10000 | 10000 | 
| tr | 20000 | 10000 | 10000 | 
| zh-classical | 100 | 100 | 100 | 
| ay | 100 | 100 | 100 | 
| ckb | 1000 | 1000 | 1000 | 
| ext | 100 | 100 | 100 | 
| hr | 20000 | 10000 | 10000 | 
| ko | 20000 | 10000 | 10000 | 
| ml | 10000 | 1000 | 1000 | 
| pa | 100 | 100 | 100 | 
| si | 100 | 100 | 100 | 
| tt | 1000 | 1000 | 1000 | 
| zh-min-nan | 100 | 100 | 100 | 
| az | 10000 | 1000 | 1000 | 
| co | 100 | 100 | 100 | 
| fa | 20000 | 10000 | 10000 | 
| hsb | 100 | 100 | 100 | 
| ksh | 100 | 100 | 100 | 
| mn | 100 | 100 | 100 | 
| pdc | 100 | 100 | 100 | 
| simple | 20000 | 1000 | 1000 | 
| ug | 100 | 100 | 100 | 
| zh-yue | 20000 | 10000 | 10000 | 
| ba | 100 | 100 | 100 | 
| crh | 100 | 100 | 100 | 
| fi | 20000 | 10000 | 10000 | 
| hu | 20000 | 10000 | 10000 | 
| ku | 100 | 100 | 100 | 
| mr | 5000 | 1000 | 1000 | 
| pl | 20000 | 10000 | 10000 | 
| sk | 20000 | 10000 | 10000 | 
| uk | 20000 | 10000 | 10000 | 
| zh | 20000 | 10000 | 10000 | 
| bar | 100 | 100 | 100 | 
| cs | 20000 | 10000 | 10000 | 
| fiu-vro | 100 | 100 | 100 | 
| hy | 15000 | 1000 | 1000 | 
| ky | 100 | 100 | 100 | 
| ms | 20000 | 1000 | 1000 | 
| pms | 100 | 100 | 100 | 
| sl | 15000 | 10000 | 10000 | 
| ur | 20000 | 1000 | 1000 | 
| bat-smg | 100 | 100 | 100 | 
| csb | 100 | 100 | 100 | 
| fo | 100 | 100 | 100 | 
| ia | 100 | 100 | 100 | 
| la | 5000 | 1000 | 1000 | 
| mt | 100 | 100 | 100 | 
| pnb | 100 | 100 | 100 | 
| so | 100 | 100 | 100 | 
| uz | 1000 | 1000 | 1000 | 
| be-x-old | 5000 | 1000 | 1000 | 
| cv | 100 | 100 | 100 | 
| fr | 20000 | 10000 | 10000 | 
| id | 20000 | 10000 | 10000 | 
| lb | 5000 | 1000 | 1000 | 
| mwl | 100 | 100 | 100 | 
| ps | 100 | 100 | 100 | 
| sq | 5000 | 1000 | 1000 | 
| vec | 100 | 100 | 100 | 
| be | 15000 | 1000 | 1000 | 
| cy | 10000 | 1000 | 1000 | 
| frr | 100 | 100 | 100 | 
| ig | 100 | 100 | 100 | 
| li | 100 | 100 | 100 | 
| my | 100 | 100 | 100 | 
| pt | 20000 | 10000 | 10000 | 
| sr | 20000 | 10000 | 10000 | 
| vep | 100 | 100 | 100 | 
@inproceedings{pan-etal-2017-cross,
    title = "Cross-lingual Name Tagging and Linking for 282 Languages",
    author = "Pan, Xiaoman  and
      Zhang, Boliang  and
      May, Jonathan  and
      Nothman, Joel  and
      Knight, Kevin  and
      Ji, Heng",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P17-1178",
    doi = "10.18653/v1/P17-1178",
    pages = "1946--1958",
    abstract = "The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating {``}silver-standard{''} annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.",
}