Vision-and-Language Transformer (ViLT)，在COCO上微调

Vision-and-Language Transformer (ViLT) 模型在 COCO 上进行了微调。它是由Kim等人在论文 ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision 中提出，并首次在 this repository 中发布的。

免责声明：ViLT发布团队没有为该模型编写模型卡，因此该模型卡是由Hugging Face团队编写的。

预期用途和限制

您可以将该模型用于图像和文本检索。

如何使用

以下是如何在PyTorch中使用该模型的示例：

from transformers import ViltProcessor, ViltForImageAndTextRetrieval
import requests
from PIL import Image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-coco")
model = ViltForImageAndTextRetrieval.from_pretrained("dandelin/vilt-b32-finetuned-coco")

# prepare inputs
encoding = processor(image, text, return_tensors="pt")

# forward pass
scores = dict()
for text in texts:
    encoding = processor(image, text, return_tensors="pt")
    outputs = model(**encoding)
    scores[text] = outputs.logits[0, :].item()

训练数据

（待完成）

训练过程

预处理

（待完成）

预训练

（待完成）

评估结果

（待完成）

BibTeX条目和引用信息

@misc{kim2021vilt,
      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, 
      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
      year={2021},
      eprint={2102.03334},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}

作者:

Wonjae Kim

数据集大小:

426.45 MB