数据集:

wikiann

英文

WikiANN 数据集卡片

数据集概述

WikiANN(有时称为PAN-X)是一个多语言命名实体识别数据集,由用IOB2格式标注的维基百科文章组成,其中包含LOC(位置)、PER(人物)和ORG(组织)标签。此版本对应于Rahimi等人(2019)的平衡训练、开发和测试集,支持原始WikiANN语料库中的176种语言中的176种。

支持的任务和排行榜

  • 命名实体识别:可以使用该数据集来训练多种语言的命名实体识别模型,或评估多语言模型的零-shot跨语言能力。

语言

数据集包含176种语言,每个配置子集中的一种语言。对应的BCP 47语言标签为:

Language tag
ace ace
af af
als als
am am
an an
ang ang
ar ar
arc arc
arz arz
as as
ast ast
ay ay
az az
ba ba
bar bar
be be
bg bg
bh bh
bn bn
bo bo
br br
bs bs
ca ca
cdo cdo
ce ce
ceb ceb
ckb ckb
co co
crh crh
cs cs
csb csb
cv cv
cy cy
da da
de de
diq diq
dv dv
el el
en en
eo eo
es es
et et
eu eu
ext ext
fa fa
fi fi
fo fo
fr fr
frr frr
fur fur
fy fy
ga ga
gan gan
gd gd
gl gl
gn gn
gu gu
hak hak
he he
hi hi
hr hr
hsb hsb
hu hu
hy hy
ia ia
id id
ig ig
ilo ilo
io io
is is
it it
ja ja
jbo jbo
jv jv
ka ka
kk kk
km km
kn kn
ko ko
ksh ksh
ku ku
ky ky
la la
lb lb
li li
lij lij
lmo lmo
ln ln
lt lt
lv lv
mg mg
mhr mhr
mi mi
min min
mk mk
ml ml
mn mn
mr mr
ms ms
mt mt
mwl mwl
my my
mzn mzn
nap nap
nds nds
ne ne
nl nl
nn nn
no no
nov nov
oc oc
or or
os os
other-bat-smg sgs
other-be-x-old be-tarask
other-cbk-zam cbk
other-eml eml
other-fiu-vro vro
other-map-bms jv-x-bms
other-simple en-basiceng
other-zh-classical lzh
other-zh-min-nan nan
other-zh-yue yue
pa pa
pdc pdc
pl pl
pms pms
pnb pnb
ps ps
pt pt
qu qu
rm rm
ro ro
ru ru
rw rw
sa sa
sah sah
scn scn
sco sco
sd sd
sh sh
si si
sk sk
sl sl
so so
sq sq
sr sr
su su
sv sv
sw sw
szl szl
ta ta
te te
tg tg
th th
tk tk
tl tl
tr tr
tt tt
ug ug
uk uk
ur ur
uz uz
vec vec
vep vep
vi vi
vls vls
vo vo
wa wa
war war
wuu wuu
xmf xmf
yi yi
yo yo
zea zea
zh zh

数据集结构

数据实例

这是“训练”集中“af”(南非语)配置子集的示例:

{
  'tokens': ['Sy', 'ander', 'seun', ',', 'Swjatopolk', ',', 'was', 'die', 'resultaat', 'van', '’n', 'buite-egtelike', 'verhouding', '.'],
  'ner_tags': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'langs': ['af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af'],
  'spans': ['PER: Swjatopolk']
}

数据字段

  • tokens:字符串特征列表。
  • langs:字符串特征列表,对应于每个令牌的语言。
  • ner_tags:分类标签列表,可能的值包括O(0)、B-PER(1)、I-PER(2)、B-ORG(3)、I-ORG(4)、B-LOC(5)和I-LOC(6)。
  • spans:字符串特征列表,是输入文本中命名实体的列表,格式为:

数据拆分

对于每个配置子集,数据被拆分为“训练”、“验证”和“测试”集,每个集合包含以下数量的示例:

Train Validation Test
ace 100 100 100
af 5000 1000 1000
als 100 100 100
am 100 100 100
an 1000 1000 1000
ang 100 100 100
ar 20000 10000 10000
arc 100 100 100
arz 100 100 100
as 100 100 100
ast 1000 1000 1000
ay 100 100 100
az 10000 1000 1000
ba 100 100 100
bar 100 100 100
bat-smg 100 100 100
be 15000 1000 1000
be-x-old 5000 1000 1000
bg 20000 10000 10000
bh 100 100 100
bn 10000 1000 1000
bo 100 100 100
br 1000 1000 1000
bs 15000 1000 1000
ca 20000 10000 10000
cbk-zam 100 100 100
cdo 100 100 100
ce 100 100 100
ceb 100 100 100
ckb 1000 1000 1000
co 100 100 100
crh 100 100 100
cs 20000 10000 10000
csb 100 100 100
cv 100 100 100
cy 10000 1000 1000
da 20000 10000 10000
de 20000 10000 10000
diq 100 100 100
dv 100 100 100
el 20000 10000 10000
eml 100 100 100
en 20000 10000 10000
eo 15000 10000 10000
es 20000 10000 10000
et 15000 10000 10000
eu 10000 10000 10000
ext 100 100 100
fa 20000 10000 10000
fi 20000 10000 10000
fiu-vro 100 100 100
fo 100 100 100
fr 20000 10000 10000
frr 100 100 100
fur 100 100 100
fy 1000 1000 1000
ga 1000 1000 1000
gan 100 100 100
gd 100 100 100
gl 15000 10000 10000
gn 100 100 100
gu 100 100 100
hak 100 100 100
he 20000 10000 10000
hi 5000 1000 1000
hr 20000 10000 10000
hsb 100 100 100
hu 20000 10000 10000
hy 15000 1000 1000
ia 100 100 100
id 20000 10000 10000
ig 100 100 100
ilo 100 100 100
io 100 100 100
is 1000 1000 1000
it 20000 10000 10000
ja 20000 10000 10000
jbo 100 100 100
jv 100 100 100
ka 10000 10000 10000
kk 1000 1000 1000
km 100 100 100
kn 100 100 100
ko 20000 10000 10000
ksh 100 100 100
ku 100 100 100
ky 100 100 100
la 5000 1000 1000
lb 5000 1000 1000
li 100 100 100
lij 100 100 100
lmo 100 100 100
ln 100 100 100
lt 10000 10000 10000
lv 10000 10000 10000
map-bms 100 100 100
mg 100 100 100
mhr 100 100 100
mi 100 100 100
min 100 100 100
mk 10000 1000 1000
ml 10000 1000 1000
mn 100 100 100
mr 5000 1000 1000
ms 20000 1000 1000
mt 100 100 100
mwl 100 100 100
my 100 100 100
mzn 100 100 100
nap 100 100 100
nds 100 100 100
ne 100 100 100
nl 20000 10000 10000
nn 20000 1000 1000
no 20000 10000 10000
nov 100 100 100
oc 100 100 100
or 100 100 100
os 100 100 100
pa 100 100 100
pdc 100 100 100
pl 20000 10000 10000
pms 100 100 100
pnb 100 100 100
ps 100 100 100
pt 20000 10000 10000
qu 100 100 100
rm 100 100 100
ro 20000 10000 10000
ru 20000 10000 10000
rw 100 100 100
sa 100 100 100
sah 100 100 100
scn 100 100 100
sco 100 100 100
sd 100 100 100
sh 20000 10000 10000
si 100 100 100
simple 20000 1000 1000
sk 20000 10000 10000
sl 15000 10000 10000
so 100 100 100
sq 5000 1000 1000
sr 20000 10000 10000
su 100 100 100
sv 20000 10000 10000
sw 1000 1000 1000
szl 100 100 100
ta 15000 1000 1000
te 1000 1000 1000
tg 100 100 100
th 20000 10000 10000
tk 100 100 100
tl 10000 1000 1000
tr 20000 10000 10000
tt 1000 1000 1000
ug 100 100 100
uk 20000 10000 10000
ur 20000 1000 1000
uz 1000 1000 1000
vec 100 100 100
vep 100 100 100
vi 20000 10000 10000
vls 100 100 100
vo 100 100 100
wa 100 100 100
war 100 100 100
wuu 100 100 100
xmf 100 100 100
yi 100 100 100
yo 100 100 100
zea 100 100 100
zh 20000 10000 10000
zh-classical 100 100 100
zh-min-nan 100 100 100
zh-yue 20000 10000 10000

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和归一化

[需要更多信息]

谁是源语言的生产者?

[需要更多信息]

注释

注释过程

[需要更多信息]

标注者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

与此文章相关的原始282个数据集为:

@inproceedings{pan-etal-2017-cross,
    title = "Cross-lingual Name Tagging and Linking for 282 Languages",
    author = "Pan, Xiaoman  and
      Zhang, Boliang  and
      May, Jonathan  and
      Nothman, Joel  and
      Knight, Kevin  and
      Ji, Heng",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P17-1178",
    doi = "10.18653/v1/P17-1178",
    pages = "1946--1958",
    abstract = "The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating {``}silver-standard{''} annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.",
}

而此版本支持的176种语言与以下文章相关:

@inproceedings{rahimi-etal-2019-massively,
    title = "Massively Multilingual Transfer for {NER}",
    author = "Rahimi, Afshin  and
      Li, Yuan  and
      Cohn, Trevor",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1015",
    pages = "151--164",
}

贡献

感谢 @lewtun @rabeehk 添加此数据集。