用于文本分类的HoC:癌症标志物预料库
HoC(Hallmarks of Cancer)预料库由1852个PubMed出版物摘要手动进行了专家注释,根据一个分类法进行了标记。分类法包含37个层次结构中的类别。对于预料库中的每个句子,可以分配一个或多个类别标签。在“标签”目录下可以找到这些标签,而经过分词的文本则可以在“文本”目录下找到。文件名为相应的PubMed ID(PMID)。
除了HoC预料库外,我们还有一个将整个PubMed按照HoC分类法进行分类的数据集( Cancer Hallmarks Analytics Tool )。
此数据集可用于训练多类别分类模型。
该预料库仅包含英文的PubMed文章:
from datasets import load_dataset
dataset = load_dataset("qanastek/HoC")
validation = dataset["validation"]
print("First element of the validation set : ", validation[0])
{
"document_id": "12634122_5",
"text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .",
"label": [9, 5, 0, 6]
}
document_id:文档的唯一标识符。
text:PubMed摘要的原始文本。
label:目前已知的10种癌症标志物之一。
| Hallmark | Search term |
|---|---|
| 1. Sustaining proliferative signaling (PS) | Proliferation Receptor Cancer |
| 'Growth factor' Cancer | |
| 'Cell cycle' Cancer | |
| 2. Evading growth suppressors (GS) | 'Cell cycle' Cancer |
| 'Contact inhibition' | |
| 3. Resisting cell death (CD) | Apoptosis Cancer |
| Necrosis Cancer | |
| Autophagy Cancer | |
| 4. Enabling replicative immortality (RI) | Senescence Cancer |
| Immortalization Cancer | |
| 5. Inducing angiogenesis (A) | Angiogenesis Cancer |
| 'Angiogenic factor' | |
| 6. Activating invasion & metastasis (IM) | Metastasis Invasion Cancer |
| 7. Genome instability & mutation (GI) | Mutation Cancer |
| 'DNA repair' Cancer | |
| Adducts Cancer | |
| 'Strand breaks' Cancer | |
| 'DNA damage' Cancer | |
| 8. Tumor-promoting inflammation (TPI) | Inflammation Cancer |
| 'Oxidative stress' Cancer | |
| Inflammation 'Immune response' Cancer | |
| 9. Deregulating cellular energetics (CE) | Glycolysis Cancer; 'Warburg effect' Cancer |
| 10. Avoiding immune destruction (ID) | 'Immune system' Cancer |
| Immunosuppression Cancer |
10种癌症标志物数据的分布情况:
| Hallmark | No. abstracts | No. sentences |
|---|---|---|
| 1. PS | 462 | 993 |
| 2. GS | 242 | 468 |
| 3. CD | 430 | 883 |
| 4. RI | 115 | 295 |
| 5. A | 143 | 357 |
| 6. IM | 291 | 667 |
| 7. GI | 333 | 771 |
| 8. TPI | 194 | 437 |
| 9. CE | 105 | 213 |
| 10. ID | 108 | 226 |
该预料库由Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan和Stenius Ulla以及Korhonen Anna制作并上传。
此预料库不包含个人或敏感信息。
HoC:Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla和Korhonen Anna
Hugging Face:Labrak Yanis(与原始预料库无关)
GNU General Public License v3.0
Permissions - Commercial use - Modification - Distribution - Patent use - Private use Limitations - Liability - Warranty Conditions - License and copyright notice - State changes - Disclose source - Same license
如果您引用了我们的出版物,我们将非常感激:
Automatic semantic classification of scientific literature according to the hallmarks of cancer
@article{baker2015automatic,
title={Automatic semantic classification of scientific literature according to the hallmarks of cancer},
author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
journal={Bioinformatics},
volume={32},
number={3},
pages={432--440},
year={2015},
publisher={Oxford University Press}
}
@article{baker2017cancer,
title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer},
author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
journal={Bioinformatics},
volume={33},
number={24},
pages={3973--3981},
year={2017},
publisher={Oxford University Press}
}
Cancer hallmark text classification using convolutional neural networks
@article{baker2017cancer,
title={Cancer hallmark text classification using convolutional neural networks},
author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo},
year={2016}
}
Initializing neural networks for hierarchical multi-label text classification
@article{baker2017initializing,
title={Initializing neural networks for hierarchical multi-label text classification},
author={Baker, Simon and Korhonen, Anna},
journal={BioNLP 2017},
pages={307--315},
year={2017}
}