英文

coatnet_rmlp_2_rw_224.sw_in12k_ft_in1k 的模型卡片

一个特定于 timm 的 CoAtNet 模型(具有受 Swin-V2 启发的 MLP Log-CPB(连续对数坐标相对位置偏差))图像分类模型。在 timm 上经过 ImageNet-12k(全面的ImageNet-22k的一个子集,具有 11821 个类别)预训练,并由 Ross Wightman 在 ImageNet-1k 上进行微调。

ImageNet-12k 训练在 TPUs 上进行,感谢 TRC 程序的支持。

微调在 8x GPU Lambda Labs 云实例上进行。

maxxvit.py 的模型变体

MaxxViT 包括一些相关的模型结构,它们共享一个共同的结构,包括:

  • CoAtNet - 在早期阶段结合 MBConv(深度可分离)卷积块和后期阶段的自注意力变换块。
  • MaxViT - 在所有阶段上均匀使用块,每个块包含一个 MBConv(深度可分离)卷积块,后面跟随两个不同分区方案(窗口和网格)的自注意力块。
  • CoAtNeXt - 一个 timm 特定的架构,它在 CoAtNet 中使用 ConvNeXt 块代替 MBConv 块。所有归一化层都是 LayerNorm(没有 BatchNorm)。
  • MaxxViT - 一个 timm 特定的架构,它在 MaxViT 中使用 ConvNeXt 块代替 MBConv 块。所有归一化层都是 LayerNorm(没有 BatchNorm)。
  • MaxxViT-V2 - 一种 MaxxViT 变体,它去除了窗口块注意力,只保留 ConvNeXt 块和网格注意力,并具有更大的宽度以进行补偿。

除了上面列出的主要变体之外,模型之间还有更细微的变化。字符串 rw 作为模型名称的任何部分,是 timm 特定的配置,其中进行了模型调整,以支持 PyTorch 的 eager 模式使用。在训练初始模型复现时创建了这些调整变体,因此有所不同。任何带有字符串 tf 的模型都是与原始论文作者基于 Tensorflow 的模型完全匹配的模型,并将其权重转移到 PyTorch 中。这包括一些 MaxViT 模型。官方的 CoAtNet 模型从未发布过。

模型详细信息

  • 模型类型:图像分类 / 特征骨干网络
  • 模型统计数据:
    • 参数(M):73.9
    • GMACs:15.2
    • 激活数(M):54.8
    • 图像尺寸:224 x 224
  • 论文:
  • 数据集:ImageNet-1k
  • 预训练数据集:ImageNet-12k

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('coatnet_rmlp_2_rw_224.sw_in12k_ft_in1k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'coatnet_rmlp_2_rw_224.sw_in12k_ft_in1k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 128, 112, 112])
    #  torch.Size([1, 128, 56, 56])
    #  torch.Size([1, 256, 28, 28])
    #  torch.Size([1, 512, 14, 14])
    #  torch.Size([1, 1024, 7, 7])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'coatnet_rmlp_2_rw_224.sw_in12k_ft_in1k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1024, 7, 7) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按 Top-1

model top1 top5 samples / sec Params (M) GMAC Act (M)
12311321 88.53 98.64 21.76 475.77 534.14 1413.22
12312321 88.32 98.54 42.53 475.32 292.78 668.76
12313321 88.20 98.53 50.87 119.88 138.02 703.99
12314321 88.04 98.40 36.42 212.33 244.75 942.15
12315321 87.98 98.56 71.75 212.03 132.55 445.84
12316321 87.92 98.54 104.71 119.65 73.80 332.90
12317321 87.81 98.37 106.55 116.14 70.97 318.95
12318321 87.47 98.37 149.49 116.09 72.98 213.74
12319321 87.39 98.31 160.80 73.88 47.69 209.43
12320321 86.89 98.02 375.86 116.14 23.15 92.64
12321321 86.64 98.02 501.03 116.09 24.20 62.77
12322321 86.60 97.92 50.75 119.88 138.02 703.99
12323321 86.57 97.89 631.88 73.87 15.09 49.22
12324321 86.52 97.88 36.04 212.33 244.75 942.15
12325321 86.49 97.90 620.58 73.88 15.18 54.78
12326321 86.29 97.80 101.09 119.65 73.80 332.90
12327321 86.23 97.69 70.56 212.03 132.55 445.84
12328321 86.10 97.76 88.63 69.13 67.26 383.77
12329321 85.67 97.58 144.25 31.05 33.49 257.59
12330321 85.54 97.46 188.35 69.02 35.87 183.65
12331321 85.11 97.38 293.46 30.98 17.53 123.42
12332321 84.93 96.97 247.71 211.79 43.68 127.35
12333321 84.90 96.96 1025.45 41.72 8.11 40.13
12334321 84.85 96.99 358.25 119.47 24.04 95.01
12335321 84.63 97.06 575.53 66.01 14.67 58.38
12336321 84.61 96.74 625.81 73.88 15.18 54.78
12337321 84.49 96.76 693.82 64.90 10.75 49.30
12338321 84.43 96.83 647.96 68.93 11.66 53.17
12339321 84.23 96.78 807.21 29.15 6.77 46.92
12340321 83.62 96.38 989.59 41.72 8.04 34.60
12341321 83.50 96.50 1100.53 29.06 5.11 33.11
12342321 83.41 96.59 1004.94 30.92 5.60 35.78
12343321 83.36 96.45 1093.03 41.69 7.85 35.47
12344321 83.11 96.33 1276.88 23.70 6.26 23.05
12345321 83.03 96.34 1341.24 16.78 4.37 26.05
12346321 82.96 96.26 1283.24 15.50 4.47 31.92
12347321 82.93 96.23 1218.17 15.45 4.46 30.28
12348321 82.39 96.19 1600.14 27.44 4.67 22.04
12349321 82.39 95.84 1831.21 27.44 4.43 18.73
12350321 82.05 95.87 2109.09 15.15 2.62 20.34
12351321 81.95 95.92 2525.52 14.70 2.47 12.80
12352321 81.70 95.64 2344.52 15.14 2.41 15.41
12353321 80.53 95.21 1594.71 7.52 1.85 24.86

按吞吐量(样本/秒)

model top1 top5 samples / sec Params (M) GMAC Act (M)
12351321 81.95 95.92 2525.52 14.70 2.47 12.80
12352321 81.70 95.64 2344.52 15.14 2.41 15.41
12350321 82.05 95.87 2109.09 15.15 2.62 20.34
12349321 82.39 95.84 1831.21 27.44 4.43 18.73
12348321 82.39 96.19 1600.14 27.44 4.67 22.04
12353321 80.53 95.21 1594.71 7.52 1.85 24.86
12345321 83.03 96.34 1341.24 16.78 4.37 26.05
12346321 82.96 96.26 1283.24 15.50 4.47 31.92
12344321 83.11 96.33 1276.88 23.70 6.26 23.05
12347321 82.93 96.23 1218.17 15.45 4.46 30.28
12341321 83.50 96.50 1100.53 29.06 5.11 33.11
12343321 83.36 96.45 1093.03 41.69 7.85 35.47
12333321 84.90 96.96 1025.45 41.72 8.11 40.13
12342321 83.41 96.59 1004.94 30.92 5.60 35.78
12340321 83.62 96.38 989.59 41.72 8.04 34.60
12339321 84.23 96.78 807.21 29.15 6.77 46.92
12337321 84.49 96.76 693.82 64.90 10.75 49.30
12338321 84.43 96.83 647.96 68.93 11.66 53.17
12323321 86.57 97.89 631.88 73.87 15.09 49.22
12336321 84.61 96.74 625.81 73.88 15.18 54.78
12325321 86.49 97.90 620.58 73.88 15.18 54.78
12335321 84.63 97.06 575.53 66.01 14.67 58.38
12321321 86.64 98.02 501.03 116.09 24.20 62.77
12320321 86.89 98.02 375.86 116.14 23.15 92.64
12334321 84.85 96.99 358.25 119.47 24.04 95.01
12331321 85.11 97.38 293.46 30.98 17.53 123.42
12332321 84.93 96.97 247.71 211.79 43.68 127.35
12330321 85.54 97.46 188.35 69.02 35.87 183.65
12319321 87.39 98.31 160.80 73.88 47.69 209.43
12318321 87.47 98.37 149.49 116.09 72.98 213.74
12329321 85.67 97.58 144.25 31.05 33.49 257.59
12317321 87.81 98.37 106.55 116.14 70.97 318.95
12316321 87.92 98.54 104.71 119.65 73.80 332.90
12326321 86.29 97.80 101.09 119.65 73.80 332.90
12328321 86.10 97.76 88.63 69.13 67.26 383.77
12315321 87.98 98.56 71.75 212.03 132.55 445.84
12327321 86.23 97.69 70.56 212.03 132.55 445.84
12313321 88.20 98.53 50.87 119.88 138.02 703.99
12322321 86.60 97.92 50.75 119.88 138.02 703.99
12312321 88.32 98.54 42.53 475.32 292.78 668.76
12314321 88.04 98.40 36.42 212.33 244.75 942.15
12324321 86.52 97.88 36.04 212.33 244.75 942.15
12311321 88.53 98.64 21.76 475.77 534.14 1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}
@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}        
@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}