Vision-and-Language Transformer (ViLT)，仅预训练模型

Vision-and-Language Transformer (ViLT) 模型在GCC+SBU+COCO+VG数据集上预训练（20万步）。它是由Kim等人在 ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision 论文中提出的，首次在 this repository 发布。注意：该模型仅包括语言建模头部。

声明：发布ViLT的团队并未为该模型撰写模型卡片，因此此模型卡片由Hugging Face团队编写。

预期用途和限制

您可以使用原始模型进行给定图像和文本片段的遮蔽语言建模。

如何使用

以下是在PyTorch中使用此模型的方法：

from transformers import ViltProcessor, ViltForMaskedLM
import requests
from PIL import Image
import re

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "a bunch of [MASK] laying on a [MASK]."

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-mlm")
model = ViltForMaskedLM.from_pretrained("dandelin/vilt-b32-mlm")

# prepare inputs
encoding = processor(image, text, return_tensors="pt")

# forward pass
outputs = model(**encoding)

tl = len(re.findall("\[MASK\]", text))
inferred_token = [text]

# gradually fill in the MASK tokens, one by one
with torch.no_grad():
    for i in range(tl):
        encoded = processor.tokenizer(inferred_token)
        input_ids = torch.tensor(encoded.input_ids).to(device)
        encoded = encoded["input_ids"][0][1:-1]
        outputs = model(input_ids=input_ids, pixel_values=pixel_values)
        mlm_logits = outputs.logits[0]  # shape (seq_len, vocab_size)
        # only take into account text features (minus CLS and SEP token)
        mlm_logits = mlm_logits[1 : input_ids.shape[1] - 1, :]
        mlm_values, mlm_ids = mlm_logits.softmax(dim=-1).max(dim=-1)
        # only take into account text
        mlm_values[torch.tensor(encoded) != 103] = 0
        select = mlm_values.argmax().item()
        encoded[select] = mlm_ids[select].item()
        inferred_token = [processor.decode(encoded)]

selected_token = ""
encoded = processor.tokenizer(inferred_token)
processor.decode(encoded.input_ids[0], skip_special_tokens=True)

训练数据

（待完成）

训练流程

前处理

（待完成）

预训练

（待完成）

评估结果

（待完成）

BibTeX 引用项和引文信息

@misc{kim2021vilt,
      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, 
      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
      year={2021},
      eprint={2102.03334},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}

作者:

Wonjae Kim

数据集大小:

518.24 MB