英文

maxxvit_rmlp_nano_rw_256.sw_in1k模型卡片

这是一个特定的timm MaxxViT模型(带有MLP Log-CPB,即Swin-V2启发的连续对数坐标相对位置偏差的图像分类模型)。由Ross Wightman在ImageNet-1k上使用timm进行训练。

ImageNet-1k的训练是在TPU上完成的,得益于 TRC 计划的支持。

maxxvit.py 中的模型变体

MaxxViT包括一些相关的模型架构,它们共享相同的结构,包括:

  • CoAtNet - 在早期阶段将MBConv(深度可分离)卷积块与后期的自注意力变换块相结合。
  • MaxViT - 所有阶段的块均匀,每个块包含一个MBConv(深度可分离)卷积块,后面紧跟两个采用不同分区方案(窗口后跟网格)的自注意力块。
  • CoAtNeXt - 这是一个timm特定的架构,它使用ConvNeXt块替代CoAtNet中的MBConv块。所有规范化层都是LayerNorm(没有BatchNorm)。
  • MaxxViT - 这是一个timm特定的架构,它使用ConvNeXt块替代MaxViT中的MBConv块。所有规范化层都是LayerNorm(没有BatchNorm)。
  • MaxxViT-V2 - 这是MaxxViT的一个变体,它去除了窗口块注意力,只保留了ConvNeXt块和带有更大宽度的网格注意力来进行补偿。

除了上述主要变体外,模型之间还有更细微的差异。包含字符串rw的任何模型名称都是timm特定的配置,其中包含了为了支持PyTorch的需求所进行的建模调整。这些模型在训练初始的复现模型时创建,因此存在一些差异。所有包含字符串tf的模型都是精确匹配原始论文作者基于Tensorflow创建的模型,并将权重转换为PyTorch。这涵盖了许多MaxViT模型。官方的CoAtNet模型从未发布过。

模型详情

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('maxxvit_rmlp_nano_rw_256.sw_in1k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxxvit_rmlp_nano_rw_256.sw_in1k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 64, 128, 128])
    #  torch.Size([1, 64, 64, 64])
    #  torch.Size([1, 128, 32, 32])
    #  torch.Size([1, 256, 16, 16])
    #  torch.Size([1, 512, 8, 8])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxxvit_rmlp_nano_rw_256.sw_in1k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 512, 8, 8) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按Top-1排序

model top1 top5 samples / sec Params (M) GMAC Act (M)
12311321 88.53 98.64 21.76 475.77 534.14 1413.22
12312321 88.32 98.54 42.53 475.32 292.78 668.76
12313321 88.20 98.53 50.87 119.88 138.02 703.99
12314321 88.04 98.40 36.42 212.33 244.75 942.15
12315321 87.98 98.56 71.75 212.03 132.55 445.84
12316321 87.92 98.54 104.71 119.65 73.80 332.90
12317321 87.81 98.37 106.55 116.14 70.97 318.95
12318321 87.47 98.37 149.49 116.09 72.98 213.74
12319321 87.39 98.31 160.80 73.88 47.69 209.43
12320321 86.89 98.02 375.86 116.14 23.15 92.64
12321321 86.64 98.02 501.03 116.09 24.20 62.77
12322321 86.60 97.92 50.75 119.88 138.02 703.99
12323321 86.57 97.89 631.88 73.87 15.09 49.22
12324321 86.52 97.88 36.04 212.33 244.75 942.15
12325321 86.49 97.90 620.58 73.88 15.18 54.78
12326321 86.29 97.80 101.09 119.65 73.80 332.90
12327321 86.23 97.69 70.56 212.03 132.55 445.84
12328321 86.10 97.76 88.63 69.13 67.26 383.77
12329321 85.67 97.58 144.25 31.05 33.49 257.59
12330321 85.54 97.46 188.35 69.02 35.87 183.65
12331321 85.11 97.38 293.46 30.98 17.53 123.42
12332321 84.93 96.97 247.71 211.79 43.68 127.35
12333321 84.90 96.96 1025.45 41.72 8.11 40.13
12334321 84.85 96.99 358.25 119.47 24.04 95.01
12335321 84.63 97.06 575.53 66.01 14.67 58.38
12336321 84.61 96.74 625.81 73.88 15.18 54.78
12337321 84.49 96.76 693.82 64.90 10.75 49.30
12338321 84.43 96.83 647.96 68.93 11.66 53.17
12339321 84.23 96.78 807.21 29.15 6.77 46.92
12340321 83.62 96.38 989.59 41.72 8.04 34.60
12341321 83.50 96.50 1100.53 29.06 5.11 33.11
12342321 83.41 96.59 1004.94 30.92 5.60 35.78
12343321 83.36 96.45 1093.03 41.69 7.85 35.47
12344321 83.11 96.33 1276.88 23.70 6.26 23.05
12345321 83.03 96.34 1341.24 16.78 4.37 26.05
12346321 82.96 96.26 1283.24 15.50 4.47 31.92
12347321 82.93 96.23 1218.17 15.45 4.46 30.28
12348321 82.39 96.19 1600.14 27.44 4.67 22.04
12349321 82.39 95.84 1831.21 27.44 4.43 18.73
12350321 82.05 95.87 2109.09 15.15 2.62 20.34
12351321 81.95 95.92 2525.52 14.70 2.47 12.80
12352321 81.70 95.64 2344.52 15.14 2.41 15.41
12353321 80.53 95.21 1594.71 7.52 1.85 24.86

按吞吐量(样本/秒)排序

model top1 top5 samples / sec Params (M) GMAC Act (M)
12351321 81.95 95.92 2525.52 14.70 2.47 12.80
12352321 81.70 95.64 2344.52 15.14 2.41 15.41
12350321 82.05 95.87 2109.09 15.15 2.62 20.34
12349321 82.39 95.84 1831.21 27.44 4.43 18.73
12348321 82.39 96.19 1600.14 27.44 4.67 22.04
12353321 80.53 95.21 1594.71 7.52 1.85 24.86
12345321 83.03 96.34 1341.24 16.78 4.37 26.05
12346321 82.96 96.26 1283.24 15.50 4.47 31.92
12344321 83.11 96.33 1276.88 23.70 6.26 23.05
12347321 82.93 96.23 1218.17 15.45 4.46 30.28
12341321 83.50 96.50 1100.53 29.06 5.11 33.11
12343321 83.36 96.45 1093.03 41.69 7.85 35.47
12333321 84.90 96.96 1025.45 41.72 8.11 40.13
12342321 83.41 96.59 1004.94 30.92 5.60 35.78
12340321 83.62 96.38 989.59 41.72 8.04 34.60
12339321 84.23 96.78 807.21 29.15 6.77 46.92
12337321 84.49 96.76 693.82 64.90 10.75 49.30
12338321 84.43 96.83 647.96 68.93 11.66 53.17
12323321 86.57 97.89 631.88 73.87 15.09 49.22
12336321 84.61 96.74 625.81 73.88 15.18 54.78
12325321 86.49 97.90 620.58 73.88 15.18 54.78
12335321 84.63 97.06 575.53 66.01 14.67 58.38
12321321 86.64 98.02 501.03 116.09 24.20 62.77
12320321 86.89 98.02 375.86 116.14 23.15 92.64
12334321 84.85 96.99 358.25 119.47 24.04 95.01
12331321 85.11 97.38 293.46 30.98 17.53 123.42
12332321 84.93 96.97 247.71 211.79 43.68 127.35
12330321 85.54 97.46 188.35 69.02 35.87 183.65
12319321 87.39 98.31 160.80 73.88 47.69 209.43
12318321 87.47 98.37 149.49 116.09 72.98 213.74
12329321 85.67 97.58 144.25 31.05 33.49 257.59
12317321 87.81 98.37 106.55 116.14 70.97 318.95
12316321 87.92 98.54 104.71 119.65 73.80 332.90
12326321 86.29 97.80 101.09 119.65 73.80 332.90
12328321 86.10 97.76 88.63 69.13 67.26 383.77
12315321 87.98 98.56 71.75 212.03 132.55 445.84
12327321 86.23 97.69 70.56 212.03 132.55 445.84
12313321 88.20 98.53 50.87 119.88 138.02 703.99
12322321 86.60 97.92 50.75 119.88 138.02 703.99
12312321 88.32 98.54 42.53 475.32 292.78 668.76
12314321 88.04 98.40 36.42 212.33 244.75 942.15
12324321 86.52 97.88 36.04 212.33 244.75 942.15
12311321 88.53 98.64 21.76 475.77 534.14 1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}
@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}        
@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}