数据集:

DFKI-SLT/few-nerd

任务:

标记分类

子任务:

named-entity-recognition

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

expert-generated

源数据集:

extended|wikipedia

其他:

structure-prediction

许可:

cc-by-sa-4.0

数据集介绍文件清单

英文

"Few-NERD" 数据集卡片

数据集摘要

此脚本用于从 https://ningding97.github.io/fewnerd/ 加载 Few-NERD 数据集。

Few-NERD 是一个大规模，细粒度手动注释的命名实体识别数据集，包括 8 个粗粒度类型，66 个细粒度类型，188,200 个句子，491,711 个实体和 4,601,223 个标记。构建了三个基准任务，一个是监督式（Few-NERD (SUP)），另外两个是少样本学习（Few-NERD (INTRA) 和 Few-NERD (INTER)）。

NER 标签使用 IO 标记方案。原始数据使用两列 CoNLL 风格的格式，句子之间用空行分隔。由于句子是随机排序的，没有提供 DOCSTART 信息。

详情请参阅 https://ningding97.github.io/fewnerd/ 和 https://aclanthology.org/2021.acl-long.248/ 。

支持的任务和排行榜

任务：命名实体识别，少样本学习模型
排行榜：
- https://ningding97.github.io/fewnerd/
- 命名实体识别： https://paperswithcode.com/sota/named-entity-recognition-on-few-nerd-sup
- 其他少样本学习模型： https://paperswithcode.com/sota/few-shot-ner-on-few-nerd-intra
- 其他少样本学习模型： https://paperswithcode.com/sota/few-shot-ner-on-few-nerd-inter

语言

英语

数据集结构

数据实例

下载的数据集文件大小：
- super：14.6 MB
- intra：11.4 MB
- inter：11.5 MB
生成的数据集大小：
- super：116.9 MB
- intra：106.2 MB
- inter：106.2 MB
使用的总磁盘空间：366.8 MB

'train' 的示例如下所示。

{
    'id': '1', 
    'tokens': ['It', 'starred', 'Hicks', "'s", 'wife', ',', 'Ellaline', 'Terriss', 'and', 'Edmund', 'Payne', '.'], 
    'ner_tags': [0, 0, 7, 0, 0, 0, 7, 7, 0, 7, 7, 0], 
    'fine_ner_tags': [0, 0, 51, 0, 0, 0, 50, 50, 0, 50, 50, 0]
}

数据字段

所有拆分的数据字段都相同。

id：一个字符串特征。
tokens：一组字符串特征的列表。
ner_tags：一个分类标签的列表，可能的值包括O（0）、art（1）、building（2）、event（3）、location（4）、organization（5）、other（6）、person（7）、product（8）。
fine_ner_tags：一个细粒度分类标签的列表，可能的值包括O（0）、art-broadcastprogram（1）、art-film（2）等等。

数据拆分

Task	Train	Dev	Test
SUP	131767	18824	37648
INTRA	99519	19358	44059
INTER	130112	18817	14007

数据集创建

策划原因

More Information Needed

源数据

初始数据收集和标准化

More Information Needed

谁是源语言的生产者？

More Information Needed

注释

注释过程

More Information Needed

谁是注释者？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

附加信息

数据集策划者

More Information Needed

许可信息

CC BY-SA 4.0 license

引用信息

@inproceedings{ding-etal-2021-nerd,
    title = "Few-{NERD}: A Few-shot Named Entity Recognition Dataset",
    author = "Ding, Ning  and
      Xu, Guangwei  and
      Chen, Yulin  and
      Wang, Xiaobin  and
      Han, Xu  and
      Xie, Pengjun  and
      Zheng, Haitao  and
      Liu, Zhiyuan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.248",
    doi = "10.18653/v1/2021.acl-long.248",
    pages = "3198--3213",
}

贡献

作者:

DFKI-SLT

数据集大小:

36.78 MB

"Few-NERD" 数据集卡片

数据集摘要

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划原因

源数据

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

偏见讨论

其他已知限制

附加信息

数据集策划者

许可信息

引用信息

贡献