模型:

timm/maxvit_large_tf_224.in1k

英文

maxvit_large_tf_224.in1k 模型卡片

一个官方的 MaxViT 图像分类模型。由论文作者在 ImageNet-1k 数据集上使用 TensorFlow 训练而成。Ross Wightman 将其从官方的 TensorFlow 实现( https://github.com/google-research/maxvit )移植到了 PyTorch。

maxxvit.py 中的模型变体

MaxxViT 包含了一系列相关的模型架构,它们共享相同的结构,包括:

  • CoAtNet - 在早期阶段使用 MBConv(深度可分离)卷积块,而在后期阶段使用自注意力 Transformer 块。
  • MaxViT - 在所有阶段上使用统一的块,每个块包含一个 MBConv(深度可分离)卷积块,后面跟着两个采用不同分区方案的自注意力块(窗口块后跟网格块)。
  • CoAtNeXt - 一种使用 ConvNeXt 块替代 CoAtNet 中的 MBConv 块的特定于 timm 的架构。所有的归一化层都是 LayerNorm(没有 BatchNorm)。
  • MaxxViT - 一种使用 ConvNeXt 块替代 MaxViT 中的 MBConv 块的特定于 timm 的架构。所有的归一化层都是 LayerNorm(没有 BatchNorm)。
  • MaxxViT-V2 - MaxxViT 的一个变体,删除了窗口块自注意力,只保留 ConvNeXt 块和带有更大宽度的网格注意力以进行补偿。

除了上面列出的主要变体之外,从一个模型到另一个模型还有更细微的变化。任何带有字符串 rw 的模型名称都是 timm 的特定配置,通过建模调整来支持 PyTorch eager 使用。这些是在训练初始模型重现时创建的,所以会有一些变化。所有带有字符串 tf 的模型都与原始论文作者的基于 Tensorflow 的模型完全匹配,并且将权重转换为 PyTorch。这包括多个 MaxViT 模型。官方的 CoAtNet 模型从未发布。

模型细节

  • 模型类型:图像分类 / 特征主干
  • 模型统计数据:
    • 参数(百万):211.8
    • GMACs:43.7
    • 激活数(百万):127.3
    • 图像尺寸:224 x 224
  • 论文:
  • 数据集:ImageNet-1k

模型使用

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('maxvit_large_tf_224.in1k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxvit_large_tf_224.in1k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 128, 112, 112])
    #  torch.Size([1, 128, 56, 56])
    #  torch.Size([1, 256, 28, 28])
    #  torch.Size([1, 512, 14, 14])
    #  torch.Size([1, 1024, 7, 7])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxvit_large_tf_224.in1k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1024, 7, 7) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按 Top-1

model top1 top5 samples / sec Params (M) GMAC Act (M)
1239321 88.53 98.64 21.76 475.77 534.14 1413.22
12310321 88.32 98.54 42.53 475.32 292.78 668.76
12311321 88.20 98.53 50.87 119.88 138.02 703.99
12312321 88.04 98.40 36.42 212.33 244.75 942.15
12313321 87.98 98.56 71.75 212.03 132.55 445.84
12314321 87.92 98.54 104.71 119.65 73.80 332.90
12315321 87.81 98.37 106.55 116.14 70.97 318.95
12316321 87.47 98.37 149.49 116.09 72.98 213.74
12317321 87.39 98.31 160.80 73.88 47.69 209.43
12318321 86.89 98.02 375.86 116.14 23.15 92.64
12319321 86.64 98.02 501.03 116.09 24.20 62.77
12320321 86.60 97.92 50.75 119.88 138.02 703.99
12321321 86.57 97.89 631.88 73.87 15.09 49.22
12322321 86.52 97.88 36.04 212.33 244.75 942.15
12323321 86.49 97.90 620.58 73.88 15.18 54.78
12324321 86.29 97.80 101.09 119.65 73.80 332.90
12325321 86.23 97.69 70.56 212.03 132.55 445.84
12326321 86.10 97.76 88.63 69.13 67.26 383.77
12327321 85.67 97.58 144.25 31.05 33.49 257.59
12328321 85.54 97.46 188.35 69.02 35.87 183.65
12329321 85.11 97.38 293.46 30.98 17.53 123.42
12330321 84.93 96.97 247.71 211.79 43.68 127.35
12331321 84.90 96.96 1025.45 41.72 8.11 40.13
12332321 84.85 96.99 358.25 119.47 24.04 95.01
12333321 84.63 97.06 575.53 66.01 14.67 58.38
12334321 84.61 96.74 625.81 73.88 15.18 54.78
12335321 84.49 96.76 693.82 64.90 10.75 49.30
12336321 84.43 96.83 647.96 68.93 11.66 53.17
12337321 84.23 96.78 807.21 29.15 6.77 46.92
12338321 83.62 96.38 989.59 41.72 8.04 34.60
12339321 83.50 96.50 1100.53 29.06 5.11 33.11
12340321 83.41 96.59 1004.94 30.92 5.60 35.78
12341321 83.36 96.45 1093.03 41.69 7.85 35.47
12342321 83.11 96.33 1276.88 23.70 6.26 23.05
12343321 83.03 96.34 1341.24 16.78 4.37 26.05
12344321 82.96 96.26 1283.24 15.50 4.47 31.92
12345321 82.93 96.23 1218.17 15.45 4.46 30.28
12346321 82.39 96.19 1600.14 27.44 4.67 22.04
12347321 82.39 95.84 1831.21 27.44 4.43 18.73
12348321 82.05 95.87 2109.09 15.15 2.62 20.34
12349321 81.95 95.92 2525.52 14.70 2.47 12.80
12350321 81.70 95.64 2344.52 15.14 2.41 15.41
12351321 80.53 95.21 1594.71 7.52 1.85 24.86

按吞吐量(每秒样本数)

model top1 top5 samples / sec Params (M) GMAC Act (M)
12349321 81.95 95.92 2525.52 14.70 2.47 12.80
12350321 81.70 95.64 2344.52 15.14 2.41 15.41
12348321 82.05 95.87 2109.09 15.15 2.62 20.34
12347321 82.39 95.84 1831.21 27.44 4.43 18.73
12346321 82.39 96.19 1600.14 27.44 4.67 22.04
12351321 80.53 95.21 1594.71 7.52 1.85 24.86
12343321 83.03 96.34 1341.24 16.78 4.37 26.05
12344321 82.96 96.26 1283.24 15.50 4.47 31.92
12342321 83.11 96.33 1276.88 23.70 6.26 23.05
12345321 82.93 96.23 1218.17 15.45 4.46 30.28
12339321 83.50 96.50 1100.53 29.06 5.11 33.11
12341321 83.36 96.45 1093.03 41.69 7.85 35.47
12331321 84.90 96.96 1025.45 41.72 8.11 40.13
12340321 83.41 96.59 1004.94 30.92 5.60 35.78
12338321 83.62 96.38 989.59 41.72 8.04 34.60
12337321 84.23 96.78 807.21 29.15 6.77 46.92
12335321 84.49 96.76 693.82 64.90 10.75 49.30
12336321 84.43 96.83 647.96 68.93 11.66 53.17
12321321 86.57 97.89 631.88 73.87 15.09 49.22
12334321 84.61 96.74 625.81 73.88 15.18 54.78
12323321 86.49 97.90 620.58 73.88 15.18 54.78
12333321 84.63 97.06 575.53 66.01 14.67 58.38
12319321 86.64 98.02 501.03 116.09 24.20 62.77
12318321 86.89 98.02 375.86 116.14 23.15 92.64
12332321 84.85 96.99 358.25 119.47 24.04 95.01
12329321 85.11 97.38 293.46 30.98 17.53 123.42
12330321 84.93 96.97 247.71 211.79 43.68 127.35
12328321 85.54 97.46 188.35 69.02 35.87 183.65
12317321 87.39 98.31 160.80 73.88 47.69 209.43
12316321 87.47 98.37 149.49 116.09 72.98 213.74
12327321 85.67 97.58 144.25 31.05 33.49 257.59
12315321 87.81 98.37 106.55 116.14 70.97 318.95
12314321 87.92 98.54 104.71 119.65 73.80 332.90
12324321 86.29 97.80 101.09 119.65 73.80 332.90
12326321 86.10 97.76 88.63 69.13 67.26 383.77
12313321 87.98 98.56 71.75 212.03 132.55 445.84
12325321 86.23 97.69 70.56 212.03 132.55 445.84
12311321 88.20 98.53 50.87 119.88 138.02 703.99
12320321 86.60 97.92 50.75 119.88 138.02 703.99
12310321 88.32 98.54 42.53 475.32 292.78 668.76
12312321 88.04 98.40 36.42 212.33 244.75 942.15
12322321 86.52 97.88 36.04 212.33 244.75 942.15
1239321 88.53 98.64 21.76 475.77 534.14 1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}
@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}        
@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}