模型:

timm/maxvit_rmlp_nano_rw_256.sw_in1k

英文

maxvit_rmlp_nano_rw_256.sw_in1k 模型卡片

这是一个基于 timm 开发的 MaxViT 模型,使用了一个 MLP Log-CPB(连续对数坐标相对位置偏差,灵感来源于 Swin-V2)进行图像分类。由 Ross Wightman 在 ImageNet-1k 上进行训练。

ImageNet-1k 训练是在 TPUs 上完成的,感谢 TRC 计划的支持。

maxxvit.py 中的模型变体

MaxxViT 包含了一系列相关的模型架构,它们具有共同的结构,包括:

  • CoAtNet - 在早期阶段结合了 MBConv(深度可分离)卷积块和后续阶段的自注意力变换块。
  • MaxViT - 在所有阶段均使用统一的块,每个块包含一个 MBConv(深度可分离)卷积块,后跟两个采用不同划分方案(窗口和网格)的自注意力块。
  • CoAtNeXt - 这是一个 timm 特定的架构,它在 CoAtNet 中使用 ConvNeXt 块替换 MBConv 块。所有的归一化层都是 LayerNorm(没有 BatchNorm)。
  • MaxxViT - 这是一个 timm 特定的架构,它在 MaxViT 中使用 ConvNeXt 块替换 MBConv 块。所有的归一化层都是 LayerNorm(没有 BatchNorm)。
  • MaxxViT-V2 - 这是 MaxxViT 的一个变体,它移除了窗口块注意力,仅保留 ConvNeXt 块和网格注意力,同时增加更大的宽度进行补偿。

除了上述列出的主要变体外,不同模型之间还存在更细微的差异。所有包含字符串 rw 的模型名称都是 timm 特定的配置,进行了模拟调整,以支持 PyTorch 的即刻模式使用。这些模型是在训练初期对模型进行复现时创建的,因此存在一些变化。所有包含字符串 tf 的模型,都是与原始论文作者基于 Tensorflow 的模型完全匹配,并将权重转移到 PyTorch 中。这涵盖了一些 MaxViT 模型。官方的 CoAtNet 模型从未发布过。

模型详细信息

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('maxvit_rmlp_nano_rw_256.sw_in1k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxvit_rmlp_nano_rw_256.sw_in1k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 64, 128, 128])
    #  torch.Size([1, 64, 64, 64])
    #  torch.Size([1, 128, 32, 32])
    #  torch.Size([1, 256, 16, 16])
    #  torch.Size([1, 512, 8, 8])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxvit_rmlp_nano_rw_256.sw_in1k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 512, 8, 8) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按 Top-1 排名

model top1 top5 samples / sec Params (M) GMAC Act (M)
12310321 88.53 98.64 21.76 475.77 534.14 1413.22
12311321 88.32 98.54 42.53 475.32 292.78 668.76
12312321 88.20 98.53 50.87 119.88 138.02 703.99
12313321 88.04 98.40 36.42 212.33 244.75 942.15
12314321 87.98 98.56 71.75 212.03 132.55 445.84
12315321 87.92 98.54 104.71 119.65 73.80 332.90
12316321 87.81 98.37 106.55 116.14 70.97 318.95
12317321 87.47 98.37 149.49 116.09 72.98 213.74
12318321 87.39 98.31 160.80 73.88 47.69 209.43
12319321 86.89 98.02 375.86 116.14 23.15 92.64
12320321 86.64 98.02 501.03 116.09 24.20 62.77
12321321 86.60 97.92 50.75 119.88 138.02 703.99
12322321 86.57 97.89 631.88 73.87 15.09 49.22
12323321 86.52 97.88 36.04 212.33 244.75 942.15
12324321 86.49 97.90 620.58 73.88 15.18 54.78
12325321 86.29 97.80 101.09 119.65 73.80 332.90
12326321 86.23 97.69 70.56 212.03 132.55 445.84
12327321 86.10 97.76 88.63 69.13 67.26 383.77
12328321 85.67 97.58 144.25 31.05 33.49 257.59
12329321 85.54 97.46 188.35 69.02 35.87 183.65
12330321 85.11 97.38 293.46 30.98 17.53 123.42
12331321 84.93 96.97 247.71 211.79 43.68 127.35
12332321 84.90 96.96 1025.45 41.72 8.11 40.13
12333321 84.85 96.99 358.25 119.47 24.04 95.01
12334321 84.63 97.06 575.53 66.01 14.67 58.38
12335321 84.61 96.74 625.81 73.88 15.18 54.78
12336321 84.49 96.76 693.82 64.90 10.75 49.30
12337321 84.43 96.83 647.96 68.93 11.66 53.17
12338321 84.23 96.78 807.21 29.15 6.77 46.92
12339321 83.62 96.38 989.59 41.72 8.04 34.60
12340321 83.50 96.50 1100.53 29.06 5.11 33.11
12341321 83.41 96.59 1004.94 30.92 5.60 35.78
12342321 83.36 96.45 1093.03 41.69 7.85 35.47
12343321 83.11 96.33 1276.88 23.70 6.26 23.05
12344321 83.03 96.34 1341.24 16.78 4.37 26.05
12345321 82.96 96.26 1283.24 15.50 4.47 31.92
12346321 82.93 96.23 1218.17 15.45 4.46 30.28
12347321 82.39 96.19 1600.14 27.44 4.67 22.04
12348321 82.39 95.84 1831.21 27.44 4.43 18.73
12349321 82.05 95.87 2109.09 15.15 2.62 20.34
12350321 81.95 95.92 2525.52 14.70 2.47 12.80
12351321 81.70 95.64 2344.52 15.14 2.41 15.41
12352321 80.53 95.21 1594.71 7.52 1.85 24.86

按吞吐量(样本/秒)排名

model top1 top5 samples / sec Params (M) GMAC Act (M)
12350321 81.95 95.92 2525.52 14.70 2.47 12.80
12351321 81.70 95.64 2344.52 15.14 2.41 15.41
12349321 82.05 95.87 2109.09 15.15 2.62 20.34
12348321 82.39 95.84 1831.21 27.44 4.43 18.73
12347321 82.39 96.19 1600.14 27.44 4.67 22.04
12352321 80.53 95.21 1594.71 7.52 1.85 24.86
12344321 83.03 96.34 1341.24 16.78 4.37 26.05
12345321 82.96 96.26 1283.24 15.50 4.47 31.92
12343321 83.11 96.33 1276.88 23.70 6.26 23.05
12346321 82.93 96.23 1218.17 15.45 4.46 30.28
12340321 83.50 96.50 1100.53 29.06 5.11 33.11
12342321 83.36 96.45 1093.03 41.69 7.85 35.47
12332321 84.90 96.96 1025.45 41.72 8.11 40.13
12341321 83.41 96.59 1004.94 30.92 5.60 35.78
12339321 83.62 96.38 989.59 41.72 8.04 34.60
12338321 84.23 96.78 807.21 29.15 6.77 46.92
12336321 84.49 96.76 693.82 64.90 10.75 49.30
12337321 84.43 96.83 647.96 68.93 11.66 53.17
12322321 86.57 97.89 631.88 73.87 15.09 49.22
12335321 84.61 96.74 625.81 73.88 15.18 54.78
12324321 86.49 97.90 620.58 73.88 15.18 54.78
12334321 84.63 97.06 575.53 66.01 14.67 58.38
12320321 86.64 98.02 501.03 116.09 24.20 62.77
12319321 86.89 98.02 375.86 116.14 23.15 92.64
12333321 84.85 96.99 358.25 119.47 24.04 95.01
12330321 85.11 97.38 293.46 30.98 17.53 123.42
12331321 84.93 96.97 247.71 211.79 43.68 127.35
12329321 85.54 97.46 188.35 69.02 35.87 183.65
12318321 87.39 98.31 160.80 73.88 47.69 209.43
12317321 87.47 98.37 149.49 116.09 72.98 213.74
12328321 85.67 97.58 144.25 31.05 33.49 257.59
12316321 87.81 98.37 106.55 116.14 70.97 318.95
12315321 87.92 98.54 104.71 119.65 73.80 332.90
12325321 86.29 97.80 101.09 119.65 73.80 332.90
12327321 86.10 97.76 88.63 69.13 67.26 383.77
12314321 87.98 98.56 71.75 212.03 132.55 445.84
12326321 86.23 97.69 70.56 212.03 132.55 445.84
12312321 88.20 98.53 50.87 119.88 138.02 703.99
12321321 86.60 97.92 50.75 119.88 138.02 703.99
12311321 88.32 98.54 42.53 475.32 292.78 668.76
12313321 88.04 98.40 36.42 212.33 244.75 942.15
12323321 86.52 97.88 36.04 212.33 244.75 942.15
12310321 88.53 98.64 21.76 475.77 534.14 1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}
@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}        
@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}