coatnet_rmlp_2_rw_224.sw_in1k 模型卡片

一个 timm 特定的 CoAtNet 模型（带有受 Swin-V2 启发的 MLP Log-CPB（连续对数坐标相对位置偏差）的图像分类模型），由 Ross Wightman 在 ImageNet-1k 上使用 timm 进行训练。

ImageNet-1k 训练使用 TPU 支持，感谢 TRC 计划的支持。

maxxvit.py 中的模型变体

MaxxViT 包含一些相关的模型架构，共享一个共同的结构，包括：

CoAtNet - 在早期阶段将 MBConv（深度可分离）卷积块与后期的自注意力变换器块相结合。
MaxViT - 在所有阶段中均匀使用块，每个块包含一个 MBConv（深度可分离）卷积块，后面跟随两个使用不同分区方案的自注意力块（窗口后跟随网格）。
CoAtNeXt - 一个 timm 特定的架构，将 CoAtNet 中的 MBConv 块替换为 ConvNeXt 块。所有归一化层都是 LayerNorm（无 BatchNorm）。
MaxxViT - 一个 timm 特定的架构，将 MaxViT 中的 MBConv 块替换为 ConvNeXt 块。所有归一化层都是 LayerNorm（无 BatchNorm）。
MaxxViT-V2 - MaxxViT 的变体，去除了窗口块自注意力，仅保留 ConvNeXt 块和网格注意力，通过更大的宽度进行补偿。

除了上述主要变体，每个模型之间还存在一些细微的变化。包含字符串 rw 的任何模型名称都是 timm 的特定配置，模型调整有利于 PyTorch eager 使用。这些模型是在训练初始复现模型时创建的，因此存在一些差异。带有字符串 tf 的所有模型都是与原始论文作者基于 Tensorflow 的模型完全匹配，权重已转换为 PyTorch。这涵盖了许多 MaxViT 模型。官方的 CoAtNet 模型从未发布过。

模型详细信息

模型类型：图像分类/特征骨干
模型统计：
- 参数（M）：73.9
- GMACs：15.2
- 激活（M）：54.8
- 图像尺寸：224 x 224
论文：
- CoAtNet：将卷积和注意力结合起来适用于所有数据尺寸： https://arxiv.org/abs/2201.03545
- Swin Transformer V2：扩展容量和分辨率： https://arxiv.org/abs/2111.09883
数据集：ImageNet-1k

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('coatnet_rmlp_2_rw_224.sw_in1k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'coatnet_rmlp_2_rw_224.sw_in1k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 128, 112, 112])
    #  torch.Size([1, 128, 56, 56])
    #  torch.Size([1, 256, 28, 28])
    #  torch.Size([1, 512, 14, 14])
    #  torch.Size([1, 1024, 7, 7])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'coatnet_rmlp_2_rw_224.sw_in1k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1024, 7, 7) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按 Top-1

model	top1	top5	samples / sec	Params (M)	GMAC	Act (M)
12310321	88.53	98.64	21.76	475.77	534.14	1413.22
12311321	88.32	98.54	42.53	475.32	292.78	668.76
12312321	88.20	98.53	50.87	119.88	138.02	703.99
12313321	88.04	98.40	36.42	212.33	244.75	942.15
12314321	87.98	98.56	71.75	212.03	132.55	445.84
12315321	87.92	98.54	104.71	119.65	73.80	332.90
12316321	87.81	98.37	106.55	116.14	70.97	318.95
12317321	87.47	98.37	149.49	116.09	72.98	213.74
12318321	87.39	98.31	160.80	73.88	47.69	209.43
12319321	86.89	98.02	375.86	116.14	23.15	92.64
12320321	86.64	98.02	501.03	116.09	24.20	62.77
12321321	86.60	97.92	50.75	119.88	138.02	703.99
12322321	86.57	97.89	631.88	73.87	15.09	49.22
12323321	86.52	97.88	36.04	212.33	244.75	942.15
12324321	86.49	97.90	620.58	73.88	15.18	54.78
12325321	86.29	97.80	101.09	119.65	73.80	332.90
12326321	86.23	97.69	70.56	212.03	132.55	445.84
12327321	86.10	97.76	88.63	69.13	67.26	383.77
12328321	85.67	97.58	144.25	31.05	33.49	257.59
12329321	85.54	97.46	188.35	69.02	35.87	183.65
12330321	85.11	97.38	293.46	30.98	17.53	123.42
12331321	84.93	96.97	247.71	211.79	43.68	127.35
12332321	84.90	96.96	1025.45	41.72	8.11	40.13
12333321	84.85	96.99	358.25	119.47	24.04	95.01
12334321	84.63	97.06	575.53	66.01	14.67	58.38
12335321	84.61	96.74	625.81	73.88	15.18	54.78
12336321	84.49	96.76	693.82	64.90	10.75	49.30
12337321	84.43	96.83	647.96	68.93	11.66	53.17
12338321	84.23	96.78	807.21	29.15	6.77	46.92
12339321	83.62	96.38	989.59	41.72	8.04	34.60
12340321	83.50	96.50	1100.53	29.06	5.11	33.11
12341321	83.41	96.59	1004.94	30.92	5.60	35.78
12342321	83.36	96.45	1093.03	41.69	7.85	35.47
12343321	83.11	96.33	1276.88	23.70	6.26	23.05
12344321	83.03	96.34	1341.24	16.78	4.37	26.05
12345321	82.96	96.26	1283.24	15.50	4.47	31.92
12346321	82.93	96.23	1218.17	15.45	4.46	30.28
12347321	82.39	96.19	1600.14	27.44	4.67	22.04
12348321	82.39	95.84	1831.21	27.44	4.43	18.73
12349321	82.05	95.87	2109.09	15.15	2.62	20.34
12350321	81.95	95.92	2525.52	14.70	2.47	12.80
12351321	81.70	95.64	2344.52	15.14	2.41	15.41
12352321	80.53	95.21	1594.71	7.52	1.85	24.86

按吞吐量（每秒样本数）

model	top1	top5	samples / sec	Params (M)	GMAC	Act (M)
12350321	81.95	95.92	2525.52	14.70	2.47	12.80
12351321	81.70	95.64	2344.52	15.14	2.41	15.41
12349321	82.05	95.87	2109.09	15.15	2.62	20.34
12348321	82.39	95.84	1831.21	27.44	4.43	18.73
12347321	82.39	96.19	1600.14	27.44	4.67	22.04
12352321	80.53	95.21	1594.71	7.52	1.85	24.86
12344321	83.03	96.34	1341.24	16.78	4.37	26.05
12345321	82.96	96.26	1283.24	15.50	4.47	31.92
12343321	83.11	96.33	1276.88	23.70	6.26	23.05
12346321	82.93	96.23	1218.17	15.45	4.46	30.28
12340321	83.50	96.50	1100.53	29.06	5.11	33.11
12342321	83.36	96.45	1093.03	41.69	7.85	35.47
12332321	84.90	96.96	1025.45	41.72	8.11	40.13
12341321	83.41	96.59	1004.94	30.92	5.60	35.78
12339321	83.62	96.38	989.59	41.72	8.04	34.60
12338321	84.23	96.78	807.21	29.15	6.77	46.92
12336321	84.49	96.76	693.82	64.90	10.75	49.30
12337321	84.43	96.83	647.96	68.93	11.66	53.17
12322321	86.57	97.89	631.88	73.87	15.09	49.22
12335321	84.61	96.74	625.81	73.88	15.18	54.78
12324321	86.49	97.90	620.58	73.88	15.18	54.78
12334321	84.63	97.06	575.53	66.01	14.67	58.38
12320321	86.64	98.02	501.03	116.09	24.20	62.77
12319321	86.89	98.02	375.86	116.14	23.15	92.64
12333321	84.85	96.99	358.25	119.47	24.04	95.01
12330321	85.11	97.38	293.46	30.98	17.53	123.42
12331321	84.93	96.97	247.71	211.79	43.68	127.35
12329321	85.54	97.46	188.35	69.02	35.87	183.65
12318321	87.39	98.31	160.80	73.88	47.69	209.43
12317321	87.47	98.37	149.49	116.09	72.98	213.74
12328321	85.67	97.58	144.25	31.05	33.49	257.59
12316321	87.81	98.37	106.55	116.14	70.97	318.95
12315321	87.92	98.54	104.71	119.65	73.80	332.90
12325321	86.29	97.80	101.09	119.65	73.80	332.90
12327321	86.10	97.76	88.63	69.13	67.26	383.77
12314321	87.98	98.56	71.75	212.03	132.55	445.84
12326321	86.23	97.69	70.56	212.03	132.55	445.84
12312321	88.20	98.53	50.87	119.88	138.02	703.99
12321321	86.60	97.92	50.75	119.88	138.02	703.99
12311321	88.32	98.54	42.53	475.32	292.78	668.76
12313321	88.04	98.40	36.42	212.33	244.75	942.15
12323321	86.52	97.88	36.04	212.33	244.75	942.15
12310321	88.53	98.64	21.76	475.77	534.14	1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}

@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}

作者:

PyTorch Image Models

数据集大小:

564.14 MB