数据集:

gap

任务:

标记分类

子任务:

coreference-resolution

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

crowdsourced

源数据集:

original

预印本库:

arxiv:1810.05201

许可:

license:unknown

数据集介绍文件清单

英文

"gap" 数据集的数据卡片

数据集概述

GAP 是一个性别平衡的数据集，包含了8,908对已进行核心指代标注的（不明确的代词，先行词名称）配对数据，这些数据是从维基百科中采样并由Google AI Language发布，用于评估实际应用中核心指代解析的性能。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据示例

default

下载的数据集文件大小：2.40 MB
生成的数据集大小：2.43 MB
总磁盘使用量：4.83 MB

'验证'的一个示例如下。

{
    "A": "aliquam ultrices sagittis",
    "A-coref": false,
    "A-offset": 208,
    "B": "elementum curabitur vitae",
    "B-coref": false,
    "B-offset": 435,
    "ID": "validation-1",
    "Pronoun": "condimentum mattis pellentesque",
    "Pronoun-offset": 948,
    "Text": "Lorem ipsum dolor",
    "URL": "sem fringilla ut"
}

数据字段

数据字段在所有拆分中是相同的。

default

ID：字符串特征。
Text：字符串特征。
Pronoun：字符串特征。
Pronoun-offset：int32特征。
A：字符串特征。
A-offset：int32特征。
A-coref：布尔特征。
B：字符串特征。
B-offset：int32特征。
B-coref：布尔特征。
URL：字符串特征。

数据拆分

name	train	validation	test
default	2000	454	2000

数据集创建

策划理由

More Information Needed

来源数据

初始数据收集和规范化

More Information Needed

语言生产者是谁？

More Information Needed

标注

标注过程

More Information Needed

标注者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

附加信息

数据集策划者

More Information Needed

许可信息

More Information Needed

引用信息

@article{webster-etal-2018-mind,
    title = "Mind the {GAP}: A Balanced Corpus of Gendered Ambiguous Pronouns",
    author = "Webster, Kellie  and
      Recasens, Marta  and
      Axelrod, Vera  and
      Baldridge, Jason",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "6",
    year = "2018",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/Q18-1042",
    doi = "10.1162/tacl_a_00240",
    pages = "605--617",
}

贡献者

感谢 @thomwolf ， @patrickvonplaten ， @otakumesi ， @lewtun 添加了该数据集。

作者:

佚名

数据集大小:

15.41 KB

"gap" 数据集的数据卡片

数据集概述

支持的任务和排行榜

语言

数据集结构

数据示例

数据字段

数据拆分

数据集创建

策划理由

来源数据

标注

个人和敏感信息

使用数据的注意事项

数据的社会影响

偏见讨论

其他已知限制

附加信息

数据集策划者

许可信息

引用信息

贡献者