数据集:
ds4sd/DocLayNet
DocLayNet 提供了80863个独特页面的逐页布局分割真实标注,使用边界框来标注11个不同类别的标签。与相关工作如PubLayNet或DocBank相比,它提供了几个独特的特点:
我们基于DocLayNet数据集举办了ICDAR 2023竞赛。更多信息请查看 https://ds4sd.github.io/icdar23-doclaynet/ 。
DocLayNet提供了四种类型的数据资产:
COCO图像记录的定义示例如下:
...
{
"id": 1,
"width": 1025,
"height": 1025,
"file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png",
// Custom fields:
"doc_category": "financial_reports" // high-level document category
"collection": "ann_reports_00_04_fancy", // sub-collection name
"doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename
"page_no": 9, // page number in original document
"precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation
},
...
doc_category字段使用以下常量之一:
financial_reports, scientific_articles, laws_and_regulations, government_tenders, manuals, patents
数据集提供了三个划分:
注释专家训练使用的标注指南可在 DocLayNet_Labeling_Guide_Public.pdf 中获取。
注释者是谁?注释是众包完成的。
该数据集由IBM Research的 Deep Search team 策展。您可以通过 deepsearch-core@zurich.ibm.com 与我们联系。
策展者:
许可证: CDLA-Permissive-1.0
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation},
doi = {10.1145/3534678.353904},
url = {https://doi.org/10.1145/3534678.3539043},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
isbn = {9781450393850},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {3743–3751},
numpages = {9},
location = {Washington DC, USA},
series = {KDD '22}
}
感谢 @dolfim-ibm 和 @cau-git 添加了该数据集。