M-BERT Base ViT-B

Github Model Card

Usage

To use this model along with the original CLIP vision encoder you need to download the code and additional linear weights from the Multilingual-CLIP Github .

Once this is done, you can load and use the model with the following code

from src import multilingual_clip

model = multilingual_clip.load_model('M-BERT-Base-ViT')
embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
print(embeddings.shape)
# Yields: torch.Size([3, 640])

About

A BERT-base-multilingual tuned to match the embedding space for 69 languages , to the embedding space of the CLIP text encoder which accompanies the ViT-B/32 vision encoder. A full list of the 100 languages used during pre-training can be found here , and a list of the 4069languages used during fine-tuning can be found in SupportedLanguages.md .

Training data pairs was generated by sampling 40k sentences for each language from the combined descriptions of GCC + MSCOCO + VizWiz , and translating them into the corresponding language. All translation was done using the AWS translate service , the quality of these translations have currently not been analyzed, but one can assume the quality varies between the 69 languages.

作者:

Multilingual-CLIP

数据集大小:

1.99 GB