数据集:
docred
任务:
语言:
计算机处理:
monolingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1906.06127许可:
文档中的多个实体通常会展示复杂的跨句子关系,现有的关系抽取(RE)方法往往只专注于提取单个实体对的句内关系,无法很好地处理这种情况。为了推动文档级关系抽取研究,我们引入了 DocRED 数据集,该数据集是从维基百科和维基数据构建的,具有以下三个特点: - DocRED 对命名实体和关系都进行了注释,并且是基于纯文本进行文档级关系抽取的最大人工标注数据集。 - DocRED 需要读取文档中多个句子来提取实体并推断它们的关系,通过综合文档的所有信息。 - 除了人工标注的数据,我们还提供了规模庞大的远程监督数据,使 DocRED 能够在监督和弱监督的场景中都能够应用。
'train_annotated' 的一个示例如下所示。
{
"labels": {
"evidence": [[0]],
"head": [0],
"relation_id": ["P1"],
"relation_text": ["is_a"],
"tail": [0]
},
"sents": [["This", "is", "a", "sentence"], ["This", "is", "another", "sentence"]],
"title": "Title of the document",
"vertexSet": [[{
"name": "sentence",
"pos": [3],
"sent_id": 0,
"type": "NN"
}, {
"name": "sentence",
"pos": [3],
"sent_id": 1,
"type": "NN"
}], [{
"name": "This",
"pos": [0],
"sent_id": 0,
"type": "NN"
}]]
}
所有拆分的数据字段相同。
default| name | train_annotated | train_distant | validation | test |
|---|---|---|---|---|
| default | 3053 | 101873 | 998 | 1000 |
@inproceedings{yao-etal-2019-docred,
title = "{D}oc{RED}: A Large-Scale Document-Level Relation Extraction Dataset",
author = "Yao, Yuan and
Ye, Deming and
Li, Peng and
Han, Xu and
Lin, Yankai and
Liu, Zhenghao and
Liu, Zhiyuan and
Huang, Lixin and
Zhou, Jie and
Sun, Maosong",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P19-1074",
doi = "10.18653/v1/P19-1074",
pages = "764--777",
}
感谢 @ghomasHudson 、 @thomwolf 、 @lhoestq 添加了该数据集。