Document Image Transformer（大型模型）

Document Image Transformer（DiT）模型是在IIT-CDIP数据集（Lewis等人，2006）上进行预训练的，该数据集包含4200万个文档图像。它在Li等人的论文 DiT: Self-supervised Pre-training for Document Image Transformer 中提出，并于 this repository 首次发布。请注意，DiT与 BEiT 的架构完全相同。

免责声明：发布DiT模型的团队未为此模型编写模型卡片，因此这份模型卡片是由Hugging Face团队编写的。

模型描述

Document Image Transformer（DiT）是一个基于transformer编码器的模型（类似于BERT），以自监督方式在大量图像上进行预训练。模型的预训练目标是基于掩码补丁，从离散VAE（dVAE）的编码器中预测视觉标记。

图像以固定大小的补丁序列（分辨率为16x16）的形式呈现给模型，补丁经过线性嵌入。在将序列馈送到Transformer编码器的层之前，还会添加绝对位置嵌入。

通过预训练模型，它学习了图像的内部表示，然后可以用于提取对下游任务有用的特征：例如，如果您有一个标记的文档图像数据集，可以在预训练的编码器之上放置一个线性层，通过训练标准分类器。

拟用途和限制

您可以将原始模型用于将文档图像编码为向量空间，但它主要用于在文档图像分类、表格检测或文档布局分析等任务上进行微调。请参阅 model hub 以查找您感兴趣的任务上进行微调的版本。

如何使用

以下是在PyTorch中使用此模型的方法：

from transformers import BeitImageProcessor, BeitForMaskedImageModeling
import torch
from PIL import Image

image = Image.open('path_to_your_document_image').convert('RGB')

processor = BeitImageProcessor.from_pretrained("microsoft/dit-large")
model = BeitForMaskedImageModeling.from_pretrained("microsoft/dit-large")

num_patches = (model.config.image_size // model.config.patch_size) ** 2
pixel_values = processor(images=image, return_tensors="pt").pixel_values
# create random boolean mask of shape (batch_size, num_patches)
bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
loss, logits = outputs.loss, outputs.logits

BibTeX条目和引用信息

@article{Lewis2006BuildingAT,
  title={Building a test collection for complex document information processing},
  author={David D. Lewis and Gady Agam and Shlomo Engelson Argamon and Ophir Frieder and David A. Grossman and Jefferson Heard},
  journal={Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval},
  year={2006}
}

作者:

Microsoft

数据集大小:

1.16 GB