英文

BLURB的数据集卡片

数据集简介

BLURB是一个生物医学自然语言处理资源的集合。在一般领域,如新闻和网络,全面的基准和排行榜(如GLUE)大大加速了开放领域NLP的进展。然而,在生物医学领域,这类资源显然很少。过去,已经有许多生物医学NLP共享任务,如BioCreative、BioNLP共享任务、SemEval和BioASQ,仅举几例。这些工作对研究界的兴趣和进展起到了重要作用,但它们通常集中在单个任务上。神经语言模型(如BERT)的出现为从未标记的文本中利用迁移学习来支持多种NLP应用提供了统一的基础。为了加快生物医学预训练策略和任务特定方法的进展,有必要创建一个涵盖各种生物医学任务的广覆盖基准。

在以前的努力的启发下(如BLUE),我们创建了BLURB(生物医学语言理解和推理基准)。BLURB包括用于基于PubMed的生物医学NLP应用的全面基准,以及一个用于跟踪社区进展的排行榜。BLURB包括六个不同任务中的十三个公开可用的数据集。为了避免对任务有许多可用数据集(例如命名实体识别(NER))给予不适当的重视,BLURB将所有任务的宏平均值作为主要评分报告。BLURB排行榜是模型无关的。任何能够使用相同的训练和开发数据生成测试预测的系统都可以参与。BLURB的主要目标是降低生物医学NLP的门槛,并帮助加快这个对社会和人类有积极影响的非常重要领域的进展。

BC5-chem

该语料库由三个不同的文章集组成,其中包含疾病、化学物质及其关系的注释。训练集(500篇文章)和开发集(500篇文章)提前发布给参与任务的参与者,以支持文本挖掘方法的发展。测试集(500篇文章)用于最终系统性能评估。

BC5-disease

该语料库由三个不同的文章集组成,其中包含疾病、化学物质及其关系的注释。训练集(500篇文章)和开发集(500篇文章)提前发布给参与任务的参与者,以支持文本挖掘方法的发展。测试集(500篇文章)用于最终系统性能评估。

BC2GM

该任务是BioCreative II基因提及任务。当前任务的训练语料库主要由BCI任务的训练和测试语料库(文本集合)组成,当前任务的测试语料库由前一任务中的额外5000个被保留的句子组成。在当前语料库中,没有提供分词信息;相反,参与者被要求在句子中通过给出基因提及的起始和结束字符来识别基因提及。与以前一样,训练集由一组句子组成,对于每个句子,有一组基因提及(GENE注释)。

NCBI Disease

NCBI疾病语料库在提到内部概念和研究资源,为生物医学自然语言处理社区提供了完全注释的研究资源。语料库特点: * 793篇PubMed摘要 * 6892个疾病提及 * 790个唯一疾病概念 * 医学主题词(MeSH®) * 现场遗传学遗传(OMIM®) * 91%的提及与单个疾病概念相关 **划分为训练、开发和测试集。 语料库注释 * 十四名标注员 * 每个文档两名标注员(随机配对) * 三个注释阶段 * 对注释的整体一致性进行检查

JNLPBA

基于GENIA Term语料库(版本3.02)的注释,BioNLP / JNLPBA Shared Task 2004涉及识别和分类与分子生物学领域的生物学家感兴趣的技术术语相对应的技术术语。该任务由GENIA项目组织,基于GENIA Term语料库(版本3.02)的注释。语料库格式:JNLPBA语料库以IOB格式分发,每行包含一个标记和其标签,由制表符分隔。句子之间用空行分隔。

EBM PICO
  • Homepage:
  • Repository:
  • Paper:
  • Leaderboard:
ChemProt
  • Homepage:
  • Repository:
  • Paper:
DDI
  • Homepage:
  • Repository:
  • Paper:
GAD
  • Homepage:
  • Repository:
  • Paper:
BIOSSES

BIOSSES是一个用于生物医学句子相似性估计的基准数据集。该数据集包含100个句子对,每个句子都是从 TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset 个包含生物医学领域文章的集合中选择的。BIOSSES语料库是从引用句中选择的句子对,即具有对参考文献的引用的句子。这些句子对由五位不同的人工评估员评估,评估员判断它们的相似性并给出从0(无关系)到4(等效)的分数。原始论文中使用五位人工标注员分配的分数均值作为金标准。金标准分数与模型估计得分之间的皮尔逊相关性被用作评估指标。相关性的强度可以根据Evans(1996)提出的一般指导方针进行评估,如下所示:

HoC
  • Homepage:
  • Repository:
  • Paper:
  • Leaderboard:
  • 联系人:
PubMedQA

我们引入了PubMedQA,这是一个从PubMed摘要中收集的新颖的生物医学问答(QA)数据集。PubMedQA的任务是使用相应的摘要对研究问题进行是/否/可能的回答(例如:是否术前使用他汀类药物可以减少冠脉搭桥术后心房颤动?),PubMedQA包含有1k个专家注释的、61.2k个未标记的和211.3k个人工生成的QA实例。每个PubMedQA实例由以下组成:(1)一个问题,它可以是现有研究文章的标题或从现有文章中派生出来的标题,(2)一个上下文,它是摘要的内容,但不包括结论,(3)一个长答案,它是摘要的结论,并且很可能回答了研究问题,(4)一个是/否/可能的答案,它总结了结论。PubMedQA是第一个需要对生物医学研究文本进行推理,尤其是对其定量内容进行推理以回答问题的QA数据集。我们最好的模型是在BioBERT的多阶段微调基础上,使用长答案词袋统计作为额外监督,实现了68.1%的准确率,而单个人类的准确率为78.0%,多数基线准确率为55.2%,仍有很大的改进空间。PubMedQA可在此 https URL 免费获取。

BioASQ

任务7b将使用包含训练和测试生物医学问题(英文)以及黄金标准(参考)答案的基准数据集。参与者需要使用相关概念(来自指定的术语和本体)、相关文章(英文,来自指定的文章库)、相关片段(来自相关文章)、相关RDF三元组(来自指定的本体)、确切答案(例如,对于事实问题的命名实体)和“理想”答案(英文段落摘要)回答每个测试问题。已经提供了2747个训练问题(之前用于干跑或测试问题),以及它们的黄金标准答案(相关概念、文章、短语片段、确切答案、摘要)。

支持的任务和排行榜

Dataset Task Train Dev Test Evaluation Metrics Added
BC5-chem NER 5203 5347 5385 F1 entity-level Yes
BC5-disease NER 4182 4244 4424 F1 entity-level Yes
NCBI-disease NER 5134 787 960 F1 entity-level Yes
BC2GM NER 15197 3061 6325 F1 entity-level Yes
JNLPBA NER 46750 4551 8662 F1 entity-level Yes
EBM PICO PICO 339167 85321 16364 Macro F1 word-level No
ChemProt Relation Extraction 18035 11268 15745 Micro F1 No
DDI Relation Extraction 25296 2496 5716 Micro F1 No
GAD Relation Extraction 4261 535 534 Micro F1 No
BIOSSES Sentence Similarity 64 16 20 Pearson Yes
HoC Document Classification 1295 186 371 Average Micro F1 No
PubMedQA Question Answering 450 50 500 Accuracy Yes
BioASQ Question Answering 670 75 140 Accuracy No

BLURB生物医学NLP基准中使用的数据集。训练、开发和测试拆分可能与BLURB中提出的不完全相同,这需要进行验证。

语言

生物医学文本的英语

数据集结构

数据实例

  • NER

    {
      'id': 0,
      'tokens': [ "DPP6", "as", "a", "candidate", "gene", "for", "neuroleptic", "-", "induced", "tardive", "dyskinesia", "." ]
      'ner_tags': [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
    }
    
  • PICO

    {
      'TBD'
    }
    
  • 关系抽取

    {
      'TBD'
    }
    
  • 句子相似性

    {'sentence 1': 'Here, looking for agents that could specifically kill KRAS mutant cells, they found that knockdown of GATA2 was synthetically lethal with KRAS mutation'
     'sentence 2': 'Not surprisingly, GATA2 knockdown in KRAS mutant cells resulted in a striking reduction of active GTP-bound RHO proteins, including the downstream ROCK kinase'
     'score': 2.2}
    
  • 文档分类

    {
      'TBD'
    }
    
  • 问答

    • PubMedQA
      {'context': {'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.',
         'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were stained with the mitochondrial dye MitoTracker Red CMXRos and examined. Mitochondrial dynamics were delineated into four categories (M1-M4) based on characteristics including distribution, motility, and membrane potential (ΔΨm). A TUNEL assay showed fragmented nDNA in a gradient over these mitochondrial stages. Chloroplasts and transvacuolar strands were also examined using live cell imaging. The possible importance of mitochondrial permeability transition pore (PTP) formation during PCD was indirectly examined via in vivo cyclosporine A (CsA) treatment. This treatment resulted in lace plant leaves with a significantly lower number of perforations compared to controls, and that displayed mitochondrial dynamics similar to that of non-PCD cells.'],
        'labels': ['BACKGROUND', 'RESULTS'],
        'meshes': ['Alismataceae',
         'Apoptosis',
         'Cell Differentiation',
         'Mitochondria',
         'Plant Leaves'],
        'reasoning_free_pred': ['y', 'e', 's'],
        'reasoning_required_pred': ['y', 'e', 's']},
       'final_decision': 'yes',
       'long_answer': 'Results depicted mitochondrial dynamics in vivo as PCD progresses within the lace plant, and highlight the correlation of this organelle with other organelles during developmental PCD. To the best of our knowledge, this is the first report of mitochondria and chloroplasts moving on transvacuolar strands to form a ring structure surrounding the nucleus during developmental PCD. Also, for the first time, we have shown the feasibility for the use of CsA in a whole plant system. Overall, our findings implicate the mitochondria as playing a critical and early role in developmentally regulated PCD in the lace plant.',
       'pubid': 21645374,
       'question': 'Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?'}
    

数据字段

  • NER
    • id : 字符串
    • ner_tags : 序列[类别标签]
    • tokens : 序列[字符串]
  • PICO
    • 待补充
  • 关系抽取
    • 待补充
  • 句子相似性
    • 句子1 : 字符串
    • 句子2 : 字符串
    • 分数 : 浮点数,范围从0(无关系)到4(等效)
  • 文档分类
    • 待补充
  • 问答
    • PubMedQA
      • pubid : 整数
      • question : 字符串
      • context : 字符串序列 [ 上下文 , 标签 , MeSH词 , 预测是否需要推理 , 预测是否不需要推理 ]
      • long_answer : 字符串
      • final_decision : 字符串
      {'context': {'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.',
         'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were stained with the mitochondrial dye MitoTracker Red CMXRos and examined. Mitochondrial dynamics were delineated into four categories (M1-M4) based on characteristics including distribution, motility, and membrane potential (ΔΨm). A TUNEL assay showed fragmented nDNA in a gradient over these mitochondrial stages. Chloroplasts and transvacuolar strands were also examined using live cell imaging. The possible importance of mitochondrial permeability transition pore (PTP) formation during PCD was indirectly examined via in vivo cyclosporine A (CsA) treatment. This treatment resulted in lace plant leaves with a significantly lower number of perforations compared to controls, and that displayed mitochondrial dynamics similar to that of non-PCD cells.'],
        'labels': ['BACKGROUND', 'RESULTS'],
        'meshes': ['Alismataceae',
         'Apoptosis',
         'Cell Differentiation',
         'Mitochondria',
         'Plant Leaves'],
        'reasoning_free_pred': ['y', 'e', 's'],
        'reasoning_required_pred': ['y', 'e', 's']},
       'final_decision': 'yes',
       'long_answer': 'Results depicted mitochondrial dynamics in vivo as PCD progresses within the lace plant, and highlight the correlation of this organelle with other organelles during developmental PCD. To the best of our knowledge, this is the first report of mitochondria and chloroplasts moving on transvacuolar strands to form a ring structure surrounding the nucleus during developmental PCD. Also, for the first time, we have shown the feasibility for the use of CsA in a whole plant system. Overall, our findings implicate the mitochondria as playing a critical and early role in developmentally regulated PCD in the lace plant.',
       'pubid': 21645374,
       'question': 'Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?'}
    

数据拆分

参见支持任务的表格。

数据集创建

策划理由

  • BC5-chem
  • BC5-disease
  • BC2GM
  • JNLPBA
  • EBM PICO
  • ChemProt
  • DDI
  • GAD
  • BIOSSES
  • HoC
  • PubMedQA
  • BioASQ

源数据

[更多信息尚需补充]

注释

所有数据集均由生物医学领域的专家获得和注释。有关详细信息,请查看不同的引文。

注释过程
  • BC5-chem
  • BC5-disease
  • BC2GM
  • JNLPBA
  • EBM PICO
  • ChemProt
  • DDI
  • GAD
  • BIOSSES - 句子对由五名不同的人工专家评估,他们判断了它们的相似性并给出了从0(无关系)到4(等效)的分数。分数范围的描述基于SemEval 2012年第六项STS任务(Agirre等人,2012年)的指南。除了注释说明外,还为注释员提供了来自生物医学文献的示例句子,用于表示各种相似度程度。
  • HoC
  • PubMedQA
  • BioASQ

数据集维护者

所有数据集均由生物医学领域的专家获得和注释。有关详细信息,请查看不同的引文。

许可信息

  • BC5-chem
  • BC5-disease
  • BC2GM
  • JNLPBA
  • EBM PICO
  • ChemProt
  • DDI
  • GAD
  • BIOSSES - BIOSSES按照 The GNU Common Public License v.3.0 的条款提供。
  • HoC
  • PubMedQA - MIT许可证版权所有(c)2019 pubmedqa
  • BioASQ

引用信息

  • BC5-chem & BC5-disease
@article{article,
           author = {Li, Jiao and Sun, Yueping and Johnson, Robin and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn and Wiegers, Thomas and lu, Zhiyong},
           year = {2016},
           month = {05},
           pages = {baw068},
           title = {BioCreative V CDR task corpus: a resource for chemical disease relation extraction},
           volume = {2016},
           journal = {Database},
           doi = {10.1093/database/baw068}
           }
  • BC2GM
 @article{article,
             author = {Smith, Larry and Tanabe, Lorraine and Ando, Rie and Kuo, Cheng-Ju and Chung, I-Fang and Hsu, Chun-Nan and Lin, Yu-Shi and Klinger, Roman and Friedrich, Christoph and Ganchev, Kuzman and Torii, Manabu and Liu, Hongfang and Haddow, Barry and Struble, Craig and Povinelli, Richard and Vlachos, Andreas and Baumgartner Jr, William and Hunter, Lawrence and Carpenter, Bob and Wilbur, W.},
             year = {2008},
             month = {09},
             pages = {S2},
             title = {Overview of BioCreative II gene mention recognition},
             volume = {9 Suppl 2},
             journal = {Genome biology},
             doi = {10.1186/gb-2008-9-s2-s2}
             }
  • JNLPBA
   @inproceedings{collier-kim-2004-introduction,
               title = "Introduction to the Bio-entity Recognition Task at {JNLPBA}",
               author = "Collier, Nigel  and
                 Kim, Jin-Dong",
               booktitle = "Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications ({NLPBA}/{B}io{NLP})",
               month = aug # " 28th and 29th",
               year = "2004",
               address = "Geneva, Switzerland",
               publisher = "COLING",
               url = "https://aclanthology.org/W04-1213",
               pages = "73--78",
               }
    ```
* NCBI Disiease
```latex
   @article{10.5555/2772763.2772800,
               author = {Dogan, Rezarta Islamaj and Leaman, Robert and Lu, Zhiyong},
               title = {NCBI Disease Corpus},
               year = {2014},
               issue_date = {February 2014},
               publisher = {Elsevier Science},
               address = {San Diego, CA, USA},
               volume = {47},
               number = {C},
               issn = {1532-0464},
               abstract = {Graphical abstractDisplay Omitted NCBI disease corpus is built as a gold-standard resource for disease recognition.793 PubMed abstracts are annotated with disease mentions and concepts (MeSH/OMIM).14 Annotators produced high consistency level and inter-annotator agreement.Normalization benchmark results demonstrate the utility of the corpus.The corpus is publicly available to the community. Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora.This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH ) or Online Mendelian Inheritance in Man (OMIM ). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/.},
               journal = {J. of Biomedical Informatics},
               month = {feb},
               pages = {1–10},
               numpages = {10}}
    ```
* EBM PICO
* ChemProt
* DDI
* GAD
* BIOSSES 
```latex
   @article{souganciouglu2017biosses,
         title={BIOSSES: a semantic sentence similarity estimation system for the biomedical domain},
         author={So{\u{g}}anc{\i}o{\u{g}}lu, Gizem and {\"O}zt{\"u}rk, Hakime and {\"O}zg{\"u}r, Arzucan},
         journal={Bioinformatics},
         volume={33},
         number={14},
         pages={i49--i58},
         year={2017},
         publisher={Oxford University Press}
       }
  • HoC
  • PubMedQA
 @inproceedings{jin2019pubmedqa,
                         title={PubMedQA: A Dataset for Biomedical Research Question Answering},
                         author={Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua},
                         booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
                         pages={2567--2577},
                         year={2019}
                       }
  • BioASQ
   @article{10.1093/bioinformatics/btv585,
       author = {Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and Högberg, Johan and Stenius, Ulla and Korhonen, Anna},
       title = "{Automatic semantic classification of scientific literature according to the hallmarks of cancer}",
       journal = {Bioinformatics},
       volume = {32},
       number = {3},
       pages = {432-440},
       year = {2015},
       month = {10},
       abstract = "{Motivation: The hallmarks of cancer have become highly influential in cancer research. They reduce the complexity of cancer into 10 principles (e.g. resisting cell death and sustaining proliferative signaling) that explain the biological capabilities acquired during the development of human tumors. Since new research depends crucially on existing knowledge, technology for semantic classification of scientific literature according to the hallmarks of cancer could greatly support literature review, knowledge discovery and applications in cancer research.Results: We present the first step toward the development of such technology. We introduce a corpus of 1499 PubMed abstracts annotated according to the scientific evidence they provide for the 10 currently known hallmarks of cancer. We use this corpus to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. We report good performance in both intrinsic and extrinsic evaluations, demonstrating both the accuracy of the methodology and its potential in supporting practical cancer research. We discuss how this approach could be developed and applied further in the future.Availability and implementation: The corpus of hallmark-annotated PubMed abstracts and the software for classification are available at: http://www.cl.cam.ac.uk/∼sb895/HoC.html .Contact:simon.baker@cl.cam.ac.uk}",
       issn = {1367-4803},
       doi = {10.1093/bioinformatics/btv585},
       url = {https://doi.org/10.1093/bioinformatics/btv585},
       eprint = {https://academic.oup.com/bioinformatics/article-pdf/32/3/432/19568147/btv585.pdf},
   }  

贡献

  • 此数据集由Jorge Abreu Vicente博士上传和生成。
  • 感谢 @GamalC 将NER数据集上传至GitHub,我从那里获得了它们。
  • 我不是BLURB生成团队的一员。这个数据集旨在帮助研究者使用BLURB中的生物医学NLP基准。
  • 感谢 @bwang482 BIOSSES dataset 上传到 BIOSSES ? dataset ,我们在这个BLURB基准中添加了它。
  • 感谢 @tuner007 将此数据集添加到 ? Hub