英文

maxvit_rmlp_base_rw_224.sw_in12k_ft_in1k的模型卡片

一种特定于timm的MaxViT模型(其中包含一个MLP Log-CPB(连续对数坐标相对位置偏差,受Swin-V2启发)图像分类模型)。在ImageNet-12k(完整的ImageNet-22k的11821类子集)上进行了预训练,并由Ross Wightman在ImageNet-1k上进行了微调。

通过8x GPU Lambda Labs 个云实例进行ImageNet-12k预训练和ImageNet-1k微调。

maxxvit.py 中的模型变体

MaxxViT涵盖了许多相关的模型架构,它们共享一种常见的结构,包括:

  • CoAtNet-在早期阶段将MBConv(深度可分离)卷积块与后续阶段的自注意力变换块相结合。
  • MaxViT-在所有阶段均采用统一的块结构,每个阶段包含一个MBConv(深度可分离)卷积块,后面跟随两个具有不同分区方案的自注意力块(窗口和网格)。
  • CoAtNeXt-一种特定于timm的架构,它在CoAtNet中使用ConvNeXt块而不是MBConv块。所有归一化层都是LayerNorm(没有BatchNorm)。
  • MaxxViT-一种特定于timm的架构,它在MaxViT中使用ConvNeXt块而不是MBConv块。所有归一化层都是LayerNorm(没有BatchNorm)。
  • MaxxViT-V2-一种MaxxViT的变种,它删除了窗口块注意力,只保留了ConvNeXt块和具有更大宽度的网格注意力以进行补偿。

除了上面列出的主要变体之外,从模型到模型还存在更细微的变化。任何带有字符串rw的模型名称都是timm特定的配置文件,其中进行了用于PyTorch eager使用的建模调整。这些模型是在训练初始重现模型时创建的,因此存在变化。所有带有字符串tf的模型都是与原始论文作者基于Tensorflow的模型完全匹配的模型,并且采用了转换到PyTorch的权重。这包括许多MaxViT模型。官方CoAtNet模型从未发布。

模型细节

  • 模型类型:图像分类/特征主干
  • 模型统计数据:
    • 参数(M):116.1
    • GMACs:23.1
    • 激活(M):92.6
    • 图像尺寸:224 x 224
  • 论文:
  • 数据集:ImageNet-1k
  • 预训练数据集:ImageNet-12k

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('maxvit_rmlp_base_rw_224.sw_in12k_ft_in1k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxvit_rmlp_base_rw_224.sw_in12k_ft_in1k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 64, 112, 112])
    #  torch.Size([1, 96, 56, 56])
    #  torch.Size([1, 192, 28, 28])
    #  torch.Size([1, 384, 14, 14])
    #  torch.Size([1, 768, 7, 7])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxvit_rmlp_base_rw_224.sw_in12k_ft_in1k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 768, 7, 7) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型对比

按Top-1比较

model top1 top5 samples / sec Params (M) GMAC Act (M)
12310321 88.53 98.64 21.76 475.77 534.14 1413.22
12311321 88.32 98.54 42.53 475.32 292.78 668.76
12312321 88.20 98.53 50.87 119.88 138.02 703.99
12313321 88.04 98.40 36.42 212.33 244.75 942.15
12314321 87.98 98.56 71.75 212.03 132.55 445.84
12315321 87.92 98.54 104.71 119.65 73.80 332.90
12316321 87.81 98.37 106.55 116.14 70.97 318.95
12317321 87.47 98.37 149.49 116.09 72.98 213.74
12318321 87.39 98.31 160.80 73.88 47.69 209.43
12319321 86.89 98.02 375.86 116.14 23.15 92.64
12320321 86.64 98.02 501.03 116.09 24.20 62.77
12321321 86.60 97.92 50.75 119.88 138.02 703.99
12322321 86.57 97.89 631.88 73.87 15.09 49.22
12323321 86.52 97.88 36.04 212.33 244.75 942.15
12324321 86.49 97.90 620.58 73.88 15.18 54.78
12325321 86.29 97.80 101.09 119.65 73.80 332.90
12326321 86.23 97.69 70.56 212.03 132.55 445.84
12327321 86.10 97.76 88.63 69.13 67.26 383.77
12328321 85.67 97.58 144.25 31.05 33.49 257.59
12329321 85.54 97.46 188.35 69.02 35.87 183.65
12330321 85.11 97.38 293.46 30.98 17.53 123.42
12331321 84.93 96.97 247.71 211.79 43.68 127.35
12332321 84.90 96.96 1025.45 41.72 8.11 40.13
12333321 84.85 96.99 358.25 119.47 24.04 95.01
12334321 84.63 97.06 575.53 66.01 14.67 58.38
12335321 84.61 96.74 625.81 73.88 15.18 54.78
12336321 84.49 96.76 693.82 64.90 10.75 49.30
12337321 84.43 96.83 647.96 68.93 11.66 53.17
12338321 84.23 96.78 807.21 29.15 6.77 46.92
12339321 83.62 96.38 989.59 41.72 8.04 34.60
12340321 83.50 96.50 1100.53 29.06 5.11 33.11
12341321 83.41 96.59 1004.94 30.92 5.60 35.78
12342321 83.36 96.45 1093.03 41.69 7.85 35.47
12343321 83.11 96.33 1276.88 23.70 6.26 23.05
12344321 83.03 96.34 1341.24 16.78 4.37 26.05
12345321 82.96 96.26 1283.24 15.50 4.47 31.92
12346321 82.93 96.23 1218.17 15.45 4.46 30.28
12347321 82.39 96.19 1600.14 27.44 4.67 22.04
12348321 82.39 95.84 1831.21 27.44 4.43 18.73
12349321 82.05 95.87 2109.09 15.15 2.62 20.34
12350321 81.95 95.92 2525.52 14.70 2.47 12.80
12351321 81.70 95.64 2344.52 15.14 2.41 15.41
12352321 80.53 95.21 1594.71 7.52 1.85 24.86

按吞吐量(样本/秒)比较

model top1 top5 samples / sec Params (M) GMAC Act (M)
12350321 81.95 95.92 2525.52 14.70 2.47 12.80
12351321 81.70 95.64 2344.52 15.14 2.41 15.41
12349321 82.05 95.87 2109.09 15.15 2.62 20.34
12348321 82.39 95.84 1831.21 27.44 4.43 18.73
12347321 82.39 96.19 1600.14 27.44 4.67 22.04
12352321 80.53 95.21 1594.71 7.52 1.85 24.86
12344321 83.03 96.34 1341.24 16.78 4.37 26.05
12345321 82.96 96.26 1283.24 15.50 4.47 31.92
12343321 83.11 96.33 1276.88 23.70 6.26 23.05
12346321 82.93 96.23 1218.17 15.45 4.46 30.28
12340321 83.50 96.50 1100.53 29.06 5.11 33.11
12342321 83.36 96.45 1093.03 41.69 7.85 35.47
12332321 84.90 96.96 1025.45 41.72 8.11 40.13
12341321 83.41 96.59 1004.94 30.92 5.60 35.78
12339321 83.62 96.38 989.59 41.72 8.04 34.60
12338321 84.23 96.78 807.21 29.15 6.77 46.92
12336321 84.49 96.76 693.82 64.90 10.75 49.30
12337321 84.43 96.83 647.96 68.93 11.66 53.17
12322321 86.57 97.89 631.88 73.87 15.09 49.22
12335321 84.61 96.74 625.81 73.88 15.18 54.78
12324321 86.49 97.90 620.58 73.88 15.18 54.78
12334321 84.63 97.06 575.53 66.01 14.67 58.38
12320321 86.64 98.02 501.03 116.09 24.20 62.77
12319321 86.89 98.02 375.86 116.14 23.15 92.64
12333321 84.85 96.99 358.25 119.47 24.04 95.01
12330321 85.11 97.38 293.46 30.98 17.53 123.42
12331321 84.93 96.97 247.71 211.79 43.68 127.35
12329321 85.54 97.46 188.35 69.02 35.87 183.65
12318321 87.39 98.31 160.80 73.88 47.69 209.43
12317321 87.47 98.37 149.49 116.09 72.98 213.74
12328321 85.67 97.58 144.25 31.05 33.49 257.59
12316321 87.81 98.37 106.55 116.14 70.97 318.95
12315321 87.92 98.54 104.71 119.65 73.80 332.90
12325321 86.29 97.80 101.09 119.65 73.80 332.90
12327321 86.10 97.76 88.63 69.13 67.26 383.77
12314321 87.98 98.56 71.75 212.03 132.55 445.84
12326321 86.23 97.69 70.56 212.03 132.55 445.84
12312321 88.20 98.53 50.87 119.88 138.02 703.99
12321321 86.60 97.92 50.75 119.88 138.02 703.99
12311321 88.32 98.54 42.53 475.32 292.78 668.76
12313321 88.04 98.40 36.42 212.33 244.75 942.15
12323321 86.52 97.88 36.04 212.33 244.75 942.15
12310321 88.53 98.64 21.76 475.77 534.14 1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}
@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}        
@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}