数据集:
scientific_papers
任务:
语言:
计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1804.05685许可:
科学论文数据集包含两组长且结构化的文档。数据集来自于ArXiv和PubMed OpenAccess仓库。
"arxiv"和"pubmed"都有两个特征:
'train'的一个示例如下所示。
This example was too long and was cropped:
{
    "abstract": "\" we have studied the leptonic decay @xmath0 , via the decay channel @xmath1 , using a sample of tagged @xmath2 decays collected...",
    "article": "\"the leptonic decays of a charged pseudoscalar meson @xmath7 are processes of the type @xmath8 , where @xmath9 , @xmath10 , or @...",
    "section_names": "[sec:introduction]introduction\n[sec:detector]data and the cleo- detector\n[sec:analysys]analysis method\n[sec:conclusion]summary"
}
 pubmed 'validation'的一个示例如下所示。
This example was too long and was cropped:
{
    "abstract": "\" background and aim : there is lack of substantial indian data on venous thromboembolism ( vte ) . \\n the aim of this study was...",
    "article": "\"approximately , one - third of patients with symptomatic vte manifests pe , whereas two - thirds manifest dvt alone .\\nboth dvt...",
    "section_names": "\"Introduction\\nSubjects and Methods\\nResults\\nDemographics and characteristics of venous thromboembolism patients\\nRisk factors ..."
}
 所有拆分的数据字段相同。
arxiv| name | train | validation | test | 
|---|---|---|---|
| arxiv | 203037 | 6436 | 6440 | 
| pubmed | 119924 | 6633 | 6658 | 
@article{Cohan_2018,
   title={A Discourse-Aware Attention Model for Abstractive Summarization of
            Long Documents},
   url={http://dx.doi.org/10.18653/v1/n18-2097},
   DOI={10.18653/v1/n18-2097},
   journal={Proceedings of the 2018 Conference of the North American Chapter of
          the Association for Computational Linguistics: Human Language
          Technologies, Volume 2 (Short Papers)},
   publisher={Association for Computational Linguistics},
   author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
   year={2018}
}
 感谢 @thomwolf , @jplu , @lewtun , @patrickvonplaten 添加此数据集。