GGPONC2的数据集卡片

GGPONC项目旨在为自然语言处理研究人员提供一个免费分发的德语医学文本语料库。临床指南特别适合创建这样的语料库，因为它们不包含任何受保护的健康信息（PHI），这使它们与其他类型的医学文本有所区别。

第二版本的语料库（GGPONC 2.0）由30个德语肿瘤学指南组成，共有1.87百万个标记单元。在超过1200个工作小时的6个月时间内，由7名医学生使用INCEpTION平台进行了完全手动实体级别的注释。这使得GGPONC 2.0成为目前最大的已注释、可自由分发的德语医学文本语料库。

注释的实体包括发现（诊断/病理学、其他发现）、物质（临床药物、营养物质/体液、外部物质）和程序（治疗性、诊断性），以及这些实体的规格。总体而言，注释者创建了超过200,000个实体注释。此外，还注释了片段关系，以明确表示德语文本中常见的省略坐标名词短语。

引用信息

@inproceedings{borchert-etal-2022-ggponc,
    title = "{GGPONC} 2.0 - The {G}erman Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline {NER} Taggers",
    author = "Borchert, Florian  and
      Lohr, Christina  and
      Modersohn, Luise  and
      Witt, Jonas  and
      Langer, Thomas  and
      Follmann, Markus  and
      Gietzelt, Matthias  and
      Arnrich, Bert  and
      Hahn, Udo  and
      Schapranow, Matthieu-P.",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.389",
    pages = "3650--3660",
}

作者:

bigbio

数据集大小:

32.61 KB