数据集:
DFKI-SLT/tacred
任务:
语言:
计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found源数据集:
extended|other预印本库:
arxiv:2104.08398许可:
TAC关系抽取数据集(TACRED)是一个大规模关系抽取数据集,包含106,264个示例,构建于每年TAC知识库填充(TAC KBP)挑战中使用的新闻稿和网络文本语料库之上。TACRED中的示例涵盖了TAC KBP挑战中使用的41种关系类型(例如,per:schools_attended和org:members),或者如果没有定义的关系,则标记为no_relation。这些示例是通过将TACKBP挑战的可用人工注释与众包相结合创建的。请参阅 Stanford's EMNLP paper ,或者他们的 EMNLP slides 以获取详细信息。
注意:
此存储库提供了数据集的所有三个版本作为BuilderConfigs-'original','revisited'和're-tacred'。只需在load_dataset方法中设置name参数,即可选择特定版本。默认情况下加载原始TACRED。
数据集中的语言为英语。
'train'的一个示例如下:
{
"id": "61b3a5c8c9a882dcfcd2",
"docid": "AFP_ENG_20070218.0019.LDC2009T13",
"relation": "org:founded_by",
"token": ["Tom", "Thabane", "resigned", "in", "October", "last", "year", "to", "form", "the", "All", "Basotho", "Convention", "-LRB-", "ABC", "-RRB-", ",", "crossing", "the", "floor", "with", "17", "members", "of", "parliament", ",", "causing", "constitutional", "monarch", "King", "Letsie", "III", "to", "dissolve", "parliament", "and", "call", "the", "snap", "election", "."],
"subj_start": 10,
"subj_end": 13,
"obj_start": 0,
"obj_end": 2,
"subj_type": "ORGANIZATION",
"obj_type": "PERSON",
"stanford_pos": ["NNP", "NNP", "VBD", "IN", "NNP", "JJ", "NN", "TO", "VB", "DT", "DT", "NNP", "NNP", "-LRB-", "NNP", "-RRB-", ",", "VBG", "DT", "NN", "IN", "CD", "NNS", "IN", "NN", ",", "VBG", "JJ", "NN", "NNP", "NNP", "NNP", "TO", "VB", "NN", "CC", "VB", "DT", "NN", "NN", "."],
"stanford_ner": ["PERSON", "PERSON", "O", "O", "DATE", "DATE", "DATE", "O", "O", "O", "O", "O", "O", "O", "ORGANIZATION", "O", "O", "O", "O", "O", "O", "NUMBER", "O", "O", "O", "O", "O", "O", "O", "O", "PERSON", "PERSON", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
"stanford_head": [2, 3, 0, 5, 3, 7, 3, 9, 3, 13, 13, 13, 9, 15, 13, 15, 3, 3, 20, 18, 23, 23, 18, 25, 23, 3, 3, 32, 32, 32, 32, 27, 34, 27, 34, 34, 34, 40, 40, 37, 3],
"stanford_deprel": ["compound", "nsubj", "ROOT", "case", "nmod", "amod", "nmod:tmod", "mark", "xcomp", "det", "compound", "compound", "dobj", "punct", "appos", "punct", "punct", "xcomp", "det", "dobj", "case", "nummod", "nmod", "case", "nmod", "punct", "xcomp", "amod", "compound", "compound", "compound", "dobj", "mark", "xcomp", "dobj", "cc", "conj", "det", "compound", "dobj", "punct"]
}
数据字段在所有拆分中都相同。
为了最小化数据集偏差,TACRED在TAC KBP挑战运行的年份之间进行了分层:
| Train | Dev | Test | |
|---|---|---|---|
| TACRED | 68,124 (TAC KBP 2009-2012) | 22,631 (TAC KBP 2013) | 15,509 (TAC KBP 2014) |
| Re-TACRED | 58,465 (TAC KBP 2009-2012) | 19,584 (TAC KBP 2013) | 13,418 (TAC KBP 2014) |
[需要更多信息]
初始数据收集和规范化
[需要更多信息]
源语言制片人是谁?[需要更多信息]
请参阅斯坦福论文和Tacred Revisited论文,以及它们的附录。
为了确保在TACRED上训练的模型不倾向于在现实世界的文本上预测假阳性,将所有找不到提及对之间关系的抽样句子完全标注为负面例子。因此,79.5%的示例标记为no_relation。
注释者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
为了尊重TAC KBP语料库的版权,TACRED是通过Linguistic Data Consortium ( LDC License )发布的。您可以从 LDC TACRED webpage 下载TACRED。如果您是LDC会员,则可以免费访问;否则,需要支付25美元的访问费。
原始数据集:
@inproceedings{zhang2017tacred,
author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.},
booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)},
title = {Position-aware Attention and Supervised Data Improve Slot Filling},
url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf},
pages = {35--45},
year = {2017}
}
对于修订版本(“revisited”),还请引用:
@inproceedings{alt-etal-2020-tacred,
title = "{TACRED} Revisited: A Thorough Evaluation of the {TACRED} Relation Extraction Task",
author = "Alt, Christoph and
Gabryszak, Aleksandra and
Hennig, Leonhard",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.142",
doi = "10.18653/v1/2020.acl-main.142",
pages = "1558--1569",
}
对于重新标记的版本(“re-tacred”),还请引用:
@inproceedings{DBLP:conf/aaai/StoicaPP21,
author = {George Stoica and
Emmanouil Antonios Platanios and
Barnab{\'{a}}s P{\'{o}}czos},
title = {Re-TACRED: Addressing Shortcomings of the {TACRED} Dataset},
booktitle = {Thirty-Fifth {AAAI} Conference on Artificial Intelligence, {AAAI}
2021, Thirty-Third Conference on Innovative Applications of Artificial
Intelligence, {IAAI} 2021, The Eleventh Symposium on Educational Advances
in Artificial Intelligence, {EAAI} 2021, Virtual Event, February 2-9,
2021},
pages = {13843--13850},
publisher = {{AAAI} Press},
year = {2021},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/17631},
}