数据集:
conll2012_ontonotesv5
任务:
计算机处理:
multilingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
OntoNotes v5.0是OntoNotes语料库的最终版本,是一个手动注释的大规模、多类型、多语言语料库,包含句法、语义和话语信息。
这个数据集是OntoNotes v5.0的扩展版本,用于CoNLL-2012共享任务。它包括英语/中文/阿拉伯语的v4训练/开发和v9测试数据,以及修正版本v12训练/开发/测试数据(仅英语)。
数据的来源是Mendeley Data repo ontonotes-conll2012 ,它似乎与官方数据相同,但用户使用此数据集需自行负责。
另请参阅paperwithcode OntoNotes 5.0 和 CoNLL-2012 的摘要。
对于数据集的更详细信息,如注释、标签集等,可以参考上述Mendeley repo中的文档。
阿拉伯语、中文、英语的V4数据,以及英语的V12数据
{
  {'document_id': 'nw/wsj/23/wsj_2311',
 'sentences': [{'part_id': 0,
                'words': ['CONCORDE', 'trans-Atlantic', 'flights', 'are', '$', '2, 'to', 'Paris', 'and', '$', '3, 'to', 'London', '.']},
                'pos_tags': [25, 18, 27, 43, 2, 12, 17, 25, 11, 2, 12, 17, 25, 7],
                'parse_tree': '(TOP(S(NP (NNP CONCORDE)  (JJ trans-Atlantic)  (NNS flights) )(VP (VBP are) (NP(NP(NP ($ $)  (CD 2,400) )(PP (IN to) (NP (NNP Paris) ))) (CC and) (NP(NP ($ $)  (CD 3,200) )(PP (IN to) (NP (NNP London) ))))) (. .) ))',
                'predicate_lemmas': [None, None, None, 'be', None, None, None, None, None, None, None, None, None, None],
                'predicate_framenet_ids': [None, None, None, '01', None, None, None, None, None, None, None, None, None, None],
                'word_senses': [None, None, None, None, None, None, None, None, None, None, None, None, None, None],
                'speaker': None,
                'named_entities': [7, 6, 0, 0, 0, 15, 0, 5, 0, 0, 15, 0, 5, 0],
                'srl_frames': [{'frames': ['B-ARG1', 'I-ARG1', 'I-ARG1', 'B-V', 'B-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'O'],
                                'verb': 'are'}],
                'coref_spans': [],
               {'part_id': 0,
                'words': ['In', 'a', 'Centennial', 'Journal', 'article', 'Oct.', '5', ',', 'the', 'fares', 'were', 'reversed', '.']}]}
                'pos_tags': [17, 13, 25, 25, 24, 25, 12, 4, 13, 27, 40, 42, 7],
                'parse_tree': '(TOP(S(PP (IN In) (NP (DT a) (NML (NNP Centennial)  (NNP Journal) ) (NN article) ))(NP (NNP Oct.)  (CD 5) ) (, ,) (NP (DT the)  (NNS fares) )(VP (VBD were) (VP (VBN reversed) )) (. .) ))',
                'predicate_lemmas': [None, None, None, None, None, None, None, None, None, None, None, 'reverse', None],
                'predicate_framenet_ids': [None, None, None, None, None, None, None, None, None, None, None, '01', None],
                'word_senses': [None, None, None, None, None, None, None, None, None, None, None, None, None],
                'speaker': None,
                'named_entities': [0, 0, 4, 22, 0, 12, 30, 0, 0, 0, 0, 0, 0],
                'srl_frames': [{'frames': ['B-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'B-ARGM-TMP', 'I-ARGM-TMP', 'O', 'B-ARG1', 'I-ARG1', 'O', 'B-V', 'O'],
                                'verb': 'reversed'}],
                'coref_spans': [],
}
 sentences中的每个元素都是一个由以下数据字段组成的字典:
每个数据集(arabic_v4,chinese_v4,english_v4,english_v12)都有3个拆分: 训练、验证和测试
[需要更多信息]
[需要更多信息]
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@inproceedings{pradhan-etal-2013-towards,
    title = "Towards Robust Linguistic Analysis using {O}nto{N}otes",
    author = {Pradhan, Sameer  and
      Moschitti, Alessandro  and
      Xue, Nianwen  and
      Ng, Hwee Tou  and
      Bj{\"o}rkelund, Anders  and
      Uryupina, Olga  and
      Zhang, Yuchen  and
      Zhong, Zhi},
    booktitle = "Proceedings of the Seventeenth Conference on Computational Natural Language Learning",
    month = aug,
    year = "2013",
    address = "Sofia, Bulgaria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W13-3516",
    pages = "143--152",
}
 感谢 @richarddwang 添加了这个数据集。