文档图像转换器（大型模型）

文档图像转换器（DiT）模型是在IIT-CDIP（Lewis等，2006）数据集上进行预训练的，该数据集包含4200万个文档图像，并在 RVL-CDIP 上进行了微调，该数据集由16个类别的400,000个灰度图像组成，每个类别有25,000个图像。它是由Li等人在 this repository 的论文中引入的，并于 BEiT 首次发布。请注意，DiT与 BEiT 的架构完全相同。

免责声明：发布DiT的团队未为该模型编写模型卡片，因此此模型卡片由Hugging Face团队编写。

模型描述

文档图像转换器（DiT）是在自我监督方式下针对大量图像进行预训练的变压器编码器模型（类似于BERT）。该模型的预训练目标是根据遮盖的补丁基于离散VAE（dVAE）的编码器预测视觉标记。

图像以固定大小的补丁序列（分辨率为16x16）的形式呈现给模型，然后进行线性嵌入。在将序列提供给变压器编码器的层之前，还添加了绝对位置嵌入。

通过预训练模型，它学习了图像的内部表示，然后可以用于提取有助于下游任务的特征：例如，如果您有一个带标签的文档图像数据集，可以在预训练的编码器之上放置一个线性层来训练标准分类器。

预期使用和限制

您可以使用原始模型将文档图像编码为向量空间，但主要用于在文档图像分类、表格检测或文档布局分析等任务上进行微调。请查看 model hub ，以查找您感兴趣的任务的微调版本。

使用方法

以下是在PyTorch中使用此模型的方法：

from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from PIL import Image

image = Image.open('path_to_your_document_image').convert('RGB')

processor = AutoImageProcessor.from_pretrained("microsoft/dit-large-finetuned-rvlcdip")
model = AutoModelForImageClassification.from_pretrained("microsoft/dit-large-finetuned-rvlcdip")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

# model predicts one of the 16 RVL-CDIP classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

BibTeX条目和引用信息

@article{Lewis2006BuildingAT,
  title={Building a test collection for complex document information processing},
  author={David D. Lewis and Gady Agam and Shlomo Engelson Argamon and Ophir Frieder and David A. Grossman and Jefferson Heard},
  journal={Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval},
  year={2006}
}

作者:

Microsoft

数据集大小:

1.13 GB