模型:

timm/coatnet_2_rw_224.sw_in12k

英文

coatnet_2_rw_224.sw_in12k的模型卡片

一种基于timm的CoAtNet图像分类模型。由Ross Wightman在ImageNet-12k(ImageNet-22k的11821类子集)上进行训练。

maxxvit.py 中的模型变种

MaxxViT涵盖了许多相关的模型架构,它们共享一个共同的结构,包括:

  • CoAtNet - 在早期阶段使用MBConv(深度可分离)卷积块,并在后期阶段使用自注意力变换块。
  • MaxViT - 在所有阶段中使用统一的块,每个块都包含一个MBConv(深度可分离)卷积块,后面跟随两个具有不同分区方案的自注意力块(窗口后跟网格)。
  • CoAtNeXt - 一个timm特定的架构,它在CoAtNet中使用ConvNeXt块而不是MBConv块。所有的标准化层都是LayerNorm(没有BatchNorm)。
  • MaxxViT - 一个timm特定的架构,它在MaxViT中使用ConvNeXt块而不是MBConv块。所有的标准化层都是LayerNorm(没有BatchNorm)。
  • MaxxViT-V2 - 一种MaxxViT变体,它删除了窗口块注意力,只剩下ConvNeXt块和网格注意力,通过增加宽度来补偿。

除了上述主要变种,模型之间还有更细微的差异。所有包含字符串"rw"的模型名称都是使用timm的特定配置文件,进行了调整以支持PyTorch的即时执行。这些配置文件是在训练模型的初始复制过程中创建的,因此存在一些变化。所有包含字符串"tf"的模型都是完全匹配原始论文作者基于Tensorflow的模型,并将权重转移到了PyTorch上。这涵盖了许多MaxViT模型。官方的CoAtNet模型从未发布。

模型细节

  • 模型类型:图像分类/特征主干
  • 模型统计数据:
    • 参数(M):85.0
    • GMACs:15.1
    • 激活函数数目(M):49.2
    • 图片尺寸:224 x 224
  • 论文:
  • 数据集:ImageNet-12k

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('coatnet_2_rw_224.sw_in12k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'coatnet_2_rw_224.sw_in12k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 128, 112, 112])
    #  torch.Size([1, 128, 56, 56])
    #  torch.Size([1, 256, 28, 28])
    #  torch.Size([1, 512, 14, 14])
    #  torch.Size([1, 1024, 7, 7])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'coatnet_2_rw_224.sw_in12k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1024, 7, 7) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按Top-1

model top1 top5 samples / sec Params (M) GMAC Act (M)
1238321 88.53 98.64 21.76 475.77 534.14 1413.22
1239321 88.32 98.54 42.53 475.32 292.78 668.76
12310321 88.20 98.53 50.87 119.88 138.02 703.99
12311321 88.04 98.40 36.42 212.33 244.75 942.15
12312321 87.98 98.56 71.75 212.03 132.55 445.84
12313321 87.92 98.54 104.71 119.65 73.80 332.90
12314321 87.81 98.37 106.55 116.14 70.97 318.95
12315321 87.47 98.37 149.49 116.09 72.98 213.74
12316321 87.39 98.31 160.80 73.88 47.69 209.43
12317321 86.89 98.02 375.86 116.14 23.15 92.64
12318321 86.64 98.02 501.03 116.09 24.20 62.77
12319321 86.60 97.92 50.75 119.88 138.02 703.99
12320321 86.57 97.89 631.88 73.87 15.09 49.22
12321321 86.52 97.88 36.04 212.33 244.75 942.15
12322321 86.49 97.90 620.58 73.88 15.18 54.78
12323321 86.29 97.80 101.09 119.65 73.80 332.90
12324321 86.23 97.69 70.56 212.03 132.55 445.84
12325321 86.10 97.76 88.63 69.13 67.26 383.77
12326321 85.67 97.58 144.25 31.05 33.49 257.59
12327321 85.54 97.46 188.35 69.02 35.87 183.65
12328321 85.11 97.38 293.46 30.98 17.53 123.42
12329321 84.93 96.97 247.71 211.79 43.68 127.35
12330321 84.90 96.96 1025.45 41.72 8.11 40.13
12331321 84.85 96.99 358.25 119.47 24.04 95.01
12332321 84.63 97.06 575.53 66.01 14.67 58.38
12333321 84.61 96.74 625.81 73.88 15.18 54.78
12334321 84.49 96.76 693.82 64.90 10.75 49.30
12335321 84.43 96.83 647.96 68.93 11.66 53.17
12336321 84.23 96.78 807.21 29.15 6.77 46.92
12337321 83.62 96.38 989.59 41.72 8.04 34.60
12338321 83.50 96.50 1100.53 29.06 5.11 33.11
12339321 83.41 96.59 1004.94 30.92 5.60 35.78
12340321 83.36 96.45 1093.03 41.69 7.85 35.47
12341321 83.11 96.33 1276.88 23.70 6.26 23.05
12342321 83.03 96.34 1341.24 16.78 4.37 26.05
12343321 82.96 96.26 1283.24 15.50 4.47 31.92
12344321 82.93 96.23 1218.17 15.45 4.46 30.28
12345321 82.39 96.19 1600.14 27.44 4.67 22.04
12346321 82.39 95.84 1831.21 27.44 4.43 18.73
12347321 82.05 95.87 2109.09 15.15 2.62 20.34
12348321 81.95 95.92 2525.52 14.70 2.47 12.80
12349321 81.70 95.64 2344.52 15.14 2.41 15.41
12350321 80.53 95.21 1594.71 7.52 1.85 24.86

按吞吐量(样本/秒)

model top1 top5 samples / sec Params (M) GMAC Act (M)
12348321 81.95 95.92 2525.52 14.70 2.47 12.80
12349321 81.70 95.64 2344.52 15.14 2.41 15.41
12347321 82.05 95.87 2109.09 15.15 2.62 20.34
12346321 82.39 95.84 1831.21 27.44 4.43 18.73
12345321 82.39 96.19 1600.14 27.44 4.67 22.04
12350321 80.53 95.21 1594.71 7.52 1.85 24.86
12342321 83.03 96.34 1341.24 16.78 4.37 26.05
12343321 82.96 96.26 1283.24 15.50 4.47 31.92
12341321 83.11 96.33 1276.88 23.70 6.26 23.05
12344321 82.93 96.23 1218.17 15.45 4.46 30.28
12338321 83.50 96.50 1100.53 29.06 5.11 33.11
12340321 83.36 96.45 1093.03 41.69 7.85 35.47
12330321 84.90 96.96 1025.45 41.72 8.11 40.13
12339321 83.41 96.59 1004.94 30.92 5.60 35.78
12337321 83.62 96.38 989.59 41.72 8.04 34.60
12336321 84.23 96.78 807.21 29.15 6.77 46.92
12334321 84.49 96.76 693.82 64.90 10.75 49.30
12335321 84.43 96.83 647.96 68.93 11.66 53.17
12320321 86.57 97.89 631.88 73.87 15.09 49.22
12333321 84.61 96.74 625.81 73.88 15.18 54.78
12322321 86.49 97.90 620.58 73.88 15.18 54.78
12332321 84.63 97.06 575.53 66.01 14.67 58.38
12318321 86.64 98.02 501.03 116.09 24.20 62.77
12317321 86.89 98.02 375.86 116.14 23.15 92.64
12331321 84.85 96.99 358.25 119.47 24.04 95.01
12328321 85.11 97.38 293.46 30.98 17.53 123.42
12329321 84.93 96.97 247.71 211.79 43.68 127.35
12327321 85.54 97.46 188.35 69.02 35.87 183.65
12316321 87.39 98.31 160.80 73.88 47.69 209.43
12315321 87.47 98.37 149.49 116.09 72.98 213.74
12326321 85.67 97.58 144.25 31.05 33.49 257.59
12314321 87.81 98.37 106.55 116.14 70.97 318.95
12313321 87.92 98.54 104.71 119.65 73.80 332.90
12323321 86.29 97.80 101.09 119.65 73.80 332.90
12325321 86.10 97.76 88.63 69.13 67.26 383.77
12312321 87.98 98.56 71.75 212.03 132.55 445.84
12324321 86.23 97.69 70.56 212.03 132.55 445.84
12310321 88.20 98.53 50.87 119.88 138.02 703.99
12319321 86.60 97.92 50.75 119.88 138.02 703.99
1239321 88.32 98.54 42.53 475.32 292.78 668.76
12311321 88.04 98.40 36.42 212.33 244.75 942.15
12321321 86.52 97.88 36.04 212.33 244.75 942.15
1238321 88.53 98.64 21.76 475.77 534.14 1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}
@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}        
@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}