数据集:
swedish_ner_corpus
任务:
语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
Webbnyheter 2012 from Spraakbanken, semi-manually annotated and adapted for CoreNLP Swedish NER. Semi-manually defined in this case as: Bootstrapped from Swedish Gazetters then manually correcte/reviewed by two independent native speaking swedish annotators. No annotator agreement calculated.
[More Information Needed]
Swedish
A sample dataset instance is provided below:
{'id': '3',
'ner_tags': [4, 4, 0, 0, 0, 0, 0, 0, 3, 3, 0],
'tokens': ['Margaretha',
'Fahlgren',
',',
'professor',
'i',
'litteraturvetenskap',
',',
'vice-rektor',
'Uppsala',
'universitet',
'.']}
Full fields:
{
"id":{
"feature_type":"Value"
"dtype":"string"
}
"tokens":{
"feature_type":"Sequence"
"feature":{
"feature_type":"Value"
"dtype":"string"
}
}
"ner_tags":{
"feature_type":"Sequence"
"dtype":"int32"
"feature":{
"feature_type":"ClassLabel"
"dtype":"int32"
"class_names":[
0:"0"
1:"LOC"
2:"MISC"
3:"ORG"
4:"PER"
]
}
}
}
[More Information Needed]
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The original dataset was provided by Språkbanken which consists of news from Swedish newspapers' websites.
https://github.com/klintan/swedish-ner-corpus/blob/master/LICENSE
[More Information Needed]
Thanks to @abhishekkrthakur for adding this dataset.