coatnet_3_rw_224.sw_in12k模型卡片

一种特定于timm的CoAtNet图像分类模型。由Ross Wightman在ImageNet-12k上使用timm进行训练（ImageNet-12k是ImageNet-22k的一个11821类子集）。

模型变体 maxxvit.py

MaxxViT涵盖了一系列相关的模型架构，这些模型架构共享相同的结构，包括：

CoAtNet - 在早期阶段将MBConv（深度可分离）卷积块与后期的自注意力Transformer块相结合。
MaxViT - 在所有阶段中使用统一的块，每个块包含一个MBConv（深度可分离）卷积块，后面跟随两个具有不同分区方案（窗口和网格）的自注意力块。
CoAtNeXt - timm特定的架构，将MBConv块替换为CoAtNet中的ConvNeXt块。所有归一化层都是LayerNorm（没有BatchNorm）。
MaxxViT - timm特定的架构，将MBConv块替换为MaxViT中的ConvNeXt块。所有归一化层都是LayerNorm（没有BatchNorm）。
MaxxViT-V2 - MaxxViT的一个变体，去除了窗口块关注，只保留ConvNeXt块和网格注意力，并增加了更多的宽度以进行补偿。

除了上述主要变体之外，模型之间还存在细微的差异。带有字符串rw的模型名称是timm特定的配置，具有有利于PyTorch eager使用的建模调整。这些模型是在训练初始复现模型时创建的，因此存在一些变化。所有包含字符串tf的模型与原始论文作者基于Tensorflow的模型完全相匹配，并将权重转移到PyTorch中。这涵盖了许多MaxViT模型。官方的CoAtNet模型从未发布过。

模型详情

模型类型：图像分类/特征主干
模型统计信息：
- 参数数量（M）：181.8
- GMACs：33.4
- 激活函数数量（M）：73.8
- 图像尺寸：224 x 224
论文：
- CoAtNet：将卷积和注意力运用于所有数据尺寸的融合： https://arxiv.org/abs/2201.03545
数据集： ImageNet-12k

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('coatnet_3_rw_224.sw_in12k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'coatnet_3_rw_224.sw_in12k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 192, 112, 112])
    #  torch.Size([1, 192, 56, 56])
    #  torch.Size([1, 384, 28, 28])
    #  torch.Size([1, 768, 14, 14])
    #  torch.Size([1, 1536, 7, 7])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'coatnet_3_rw_224.sw_in12k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1536, 7, 7) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按Top-1指标

model	top1	top5	samples / sec	Params (M)	GMAC	Act (M)
1238321	88.53	98.64	21.76	475.77	534.14	1413.22
1239321	88.32	98.54	42.53	475.32	292.78	668.76
12310321	88.20	98.53	50.87	119.88	138.02	703.99
12311321	88.04	98.40	36.42	212.33	244.75	942.15
12312321	87.98	98.56	71.75	212.03	132.55	445.84
12313321	87.92	98.54	104.71	119.65	73.80	332.90
12314321	87.81	98.37	106.55	116.14	70.97	318.95
12315321	87.47	98.37	149.49	116.09	72.98	213.74
12316321	87.39	98.31	160.80	73.88	47.69	209.43
12317321	86.89	98.02	375.86	116.14	23.15	92.64
12318321	86.64	98.02	501.03	116.09	24.20	62.77
12319321	86.60	97.92	50.75	119.88	138.02	703.99
12320321	86.57	97.89	631.88	73.87	15.09	49.22
12321321	86.52	97.88	36.04	212.33	244.75	942.15
12322321	86.49	97.90	620.58	73.88	15.18	54.78
12323321	86.29	97.80	101.09	119.65	73.80	332.90
12324321	86.23	97.69	70.56	212.03	132.55	445.84
12325321	86.10	97.76	88.63	69.13	67.26	383.77
12326321	85.67	97.58	144.25	31.05	33.49	257.59
12327321	85.54	97.46	188.35	69.02	35.87	183.65
12328321	85.11	97.38	293.46	30.98	17.53	123.42
12329321	84.93	96.97	247.71	211.79	43.68	127.35
12330321	84.90	96.96	1025.45	41.72	8.11	40.13
12331321	84.85	96.99	358.25	119.47	24.04	95.01
12332321	84.63	97.06	575.53	66.01	14.67	58.38
12333321	84.61	96.74	625.81	73.88	15.18	54.78
12334321	84.49	96.76	693.82	64.90	10.75	49.30
12335321	84.43	96.83	647.96	68.93	11.66	53.17
12336321	84.23	96.78	807.21	29.15	6.77	46.92
12337321	83.62	96.38	989.59	41.72	8.04	34.60
12338321	83.50	96.50	1100.53	29.06	5.11	33.11
12339321	83.41	96.59	1004.94	30.92	5.60	35.78
12340321	83.36	96.45	1093.03	41.69	7.85	35.47
12341321	83.11	96.33	1276.88	23.70	6.26	23.05
12342321	83.03	96.34	1341.24	16.78	4.37	26.05
12343321	82.96	96.26	1283.24	15.50	4.47	31.92
12344321	82.93	96.23	1218.17	15.45	4.46	30.28
12345321	82.39	96.19	1600.14	27.44	4.67	22.04
12346321	82.39	95.84	1831.21	27.44	4.43	18.73
12347321	82.05	95.87	2109.09	15.15	2.62	20.34
12348321	81.95	95.92	2525.52	14.70	2.47	12.80
12349321	81.70	95.64	2344.52	15.14	2.41	15.41
12350321	80.53	95.21	1594.71	7.52	1.85	24.86

按吞吐量（样本/秒）

model	top1	top5	samples / sec	Params (M)	GMAC	Act (M)
12348321	81.95	95.92	2525.52	14.70	2.47	12.80
12349321	81.70	95.64	2344.52	15.14	2.41	15.41
12347321	82.05	95.87	2109.09	15.15	2.62	20.34
12346321	82.39	95.84	1831.21	27.44	4.43	18.73
12345321	82.39	96.19	1600.14	27.44	4.67	22.04
12350321	80.53	95.21	1594.71	7.52	1.85	24.86
12342321	83.03	96.34	1341.24	16.78	4.37	26.05
12343321	82.96	96.26	1283.24	15.50	4.47	31.92
12341321	83.11	96.33	1276.88	23.70	6.26	23.05
12344321	82.93	96.23	1218.17	15.45	4.46	30.28
12338321	83.50	96.50	1100.53	29.06	5.11	33.11
12340321	83.36	96.45	1093.03	41.69	7.85	35.47
12330321	84.90	96.96	1025.45	41.72	8.11	40.13
12339321	83.41	96.59	1004.94	30.92	5.60	35.78
12337321	83.62	96.38	989.59	41.72	8.04	34.60
12336321	84.23	96.78	807.21	29.15	6.77	46.92
12334321	84.49	96.76	693.82	64.90	10.75	49.30
12335321	84.43	96.83	647.96	68.93	11.66	53.17
12320321	86.57	97.89	631.88	73.87	15.09	49.22
12333321	84.61	96.74	625.81	73.88	15.18	54.78
12322321	86.49	97.90	620.58	73.88	15.18	54.78
12332321	84.63	97.06	575.53	66.01	14.67	58.38
12318321	86.64	98.02	501.03	116.09	24.20	62.77
12317321	86.89	98.02	375.86	116.14	23.15	92.64
12331321	84.85	96.99	358.25	119.47	24.04	95.01
12328321	85.11	97.38	293.46	30.98	17.53	123.42
12329321	84.93	96.97	247.71	211.79	43.68	127.35
12327321	85.54	97.46	188.35	69.02	35.87	183.65
12316321	87.39	98.31	160.80	73.88	47.69	209.43
12315321	87.47	98.37	149.49	116.09	72.98	213.74
12326321	85.67	97.58	144.25	31.05	33.49	257.59
12314321	87.81	98.37	106.55	116.14	70.97	318.95
12313321	87.92	98.54	104.71	119.65	73.80	332.90
12323321	86.29	97.80	101.09	119.65	73.80	332.90
12325321	86.10	97.76	88.63	69.13	67.26	383.77
12312321	87.98	98.56	71.75	212.03	132.55	445.84
12324321	86.23	97.69	70.56	212.03	132.55	445.84
12310321	88.20	98.53	50.87	119.88	138.02	703.99
12319321	86.60	97.92	50.75	119.88	138.02	703.99
1239321	88.32	98.54	42.53	475.32	292.78	668.76
12311321	88.04	98.40	36.42	212.33	244.75	942.15
12321321	86.52	97.88	36.04	212.33	244.75	942.15
1238321	88.53	98.64	21.76	475.77	534.14	1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}

@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}

作者:

PyTorch Image Models

数据集大小:

1.36 GB