maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k 的模型卡

这是一个基于 timm 的特定MaxxViT-V2模型（使用MLP Log-CPB（受Swin-V2启发的连续对数坐标相对位置偏差）进行图像分类）。在ImageNet-12k（完整ImageNet-22k的11821类子集）上进行预训练，并由Ross Wightman在ImageNet-1k上进行微调。

在8x GPU Lambda Labs 云实例上进行了ImageNet-12k预训练和ImageNet-1k微调。

maxxvit.py 中的模型变体

MaxxViT包括一系列相关的模型架构，它们共享一个共同的结构，包括：

CoAtNet-在早期阶段结合MBConv（深度可分离）卷积块和后期阶段的自注意力变换块。
MaxViT-在所有阶段均统一使用的块，每个块包含一个MBConv（深度可分离）卷积块，后面是两个具有不同分区方案（窗口和网格）的自注意力块。
CoAtNeXt-一个特定于timm的架构，其在CoAtNet中使用ConvNeXt块代替MBConv块。所有的归一化层都是LayerNorm（没有BatchNorm）。
MaxxViT-一个特定于timm的架构，其在MaxViT中使用ConvNeXt块代替MBConv块。所有的归一化层都是LayerNorm（没有BatchNorm）。
MaxxViT-V2-对MaxxViT的变种，去除了窗口块的注意力，只留下ConvNeXt块和带有更大宽度的网格注意力以进行补偿。

除了上述主要变体之外，模型之间还存在更细微的变化。任何模型名称中包含字符串rw的都是timm特定配置，其中进行了建模调整，以适应PyTorch的即时使用需求。这些模型是在训练初始的模型复现过程中创建的，因此存在一些不同。所有包含字符串tf的模型都是与原始论文作者基于Tensorflow的模型完全匹配的模型，其权重已转换为PyTorch格式。这包括多个MaxViT模型。官方CoAtNet模型从未发布过。

模型细节

模型类型：图像分类 / 特征骨干
模型统计数据：
- 参数（百万）：116.1
- GMACs：73.0
- 激活数量（百万）：213.7
- 图像尺寸：384 x 384
论文：
- MaxViT：多轴视觉变换器： https://arxiv.org/abs/2204.01697
- 2020年的卷积神经网络： https://arxiv.org/abs/2201.03545
- Swin Transformer V2：扩展容量和分辨率： https://arxiv.org/abs/2111.09883
数据集：ImageNet-1k
预训练数据集：ImageNet-12k

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 128, 192, 192])
    #  torch.Size([1, 128, 96, 96])
    #  torch.Size([1, 256, 48, 48])
    #  torch.Size([1, 512, 24, 24])
    #  torch.Size([1, 1024, 12, 12])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1024, 12, 12) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按Top-1指标

model	top1	top5	samples / sec	Params (M)	GMAC	Act (M)
12311321	88.53	98.64	21.76	475.77	534.14	1413.22
12312321	88.32	98.54	42.53	475.32	292.78	668.76
12313321	88.20	98.53	50.87	119.88	138.02	703.99
12314321	88.04	98.40	36.42	212.33	244.75	942.15
12315321	87.98	98.56	71.75	212.03	132.55	445.84
12316321	87.92	98.54	104.71	119.65	73.80	332.90
12317321	87.81	98.37	106.55	116.14	70.97	318.95
12318321	87.47	98.37	149.49	116.09	72.98	213.74
12319321	87.39	98.31	160.80	73.88	47.69	209.43
12320321	86.89	98.02	375.86	116.14	23.15	92.64
12321321	86.64	98.02	501.03	116.09	24.20	62.77
12322321	86.60	97.92	50.75	119.88	138.02	703.99
12323321	86.57	97.89	631.88	73.87	15.09	49.22
12324321	86.52	97.88	36.04	212.33	244.75	942.15
12325321	86.49	97.90	620.58	73.88	15.18	54.78
12326321	86.29	97.80	101.09	119.65	73.80	332.90
12327321	86.23	97.69	70.56	212.03	132.55	445.84
12328321	86.10	97.76	88.63	69.13	67.26	383.77
12329321	85.67	97.58	144.25	31.05	33.49	257.59
12330321	85.54	97.46	188.35	69.02	35.87	183.65
12331321	85.11	97.38	293.46	30.98	17.53	123.42
12332321	84.93	96.97	247.71	211.79	43.68	127.35
12333321	84.90	96.96	1025.45	41.72	8.11	40.13
12334321	84.85	96.99	358.25	119.47	24.04	95.01
12335321	84.63	97.06	575.53	66.01	14.67	58.38
12336321	84.61	96.74	625.81	73.88	15.18	54.78
12337321	84.49	96.76	693.82	64.90	10.75	49.30
12338321	84.43	96.83	647.96	68.93	11.66	53.17
12339321	84.23	96.78	807.21	29.15	6.77	46.92
12340321	83.62	96.38	989.59	41.72	8.04	34.60
12341321	83.50	96.50	1100.53	29.06	5.11	33.11
12342321	83.41	96.59	1004.94	30.92	5.60	35.78
12343321	83.36	96.45	1093.03	41.69	7.85	35.47
12344321	83.11	96.33	1276.88	23.70	6.26	23.05
12345321	83.03	96.34	1341.24	16.78	4.37	26.05
12346321	82.96	96.26	1283.24	15.50	4.47	31.92
12347321	82.93	96.23	1218.17	15.45	4.46	30.28
12348321	82.39	96.19	1600.14	27.44	4.67	22.04
12349321	82.39	95.84	1831.21	27.44	4.43	18.73
12350321	82.05	95.87	2109.09	15.15	2.62	20.34
12351321	81.95	95.92	2525.52	14.70	2.47	12.80
12352321	81.70	95.64	2344.52	15.14	2.41	15.41
12353321	80.53	95.21	1594.71	7.52	1.85	24.86

按吞吐量（样本/秒）

model	top1	top5	samples / sec	Params (M)	GMAC	Act (M)
12351321	81.95	95.92	2525.52	14.70	2.47	12.80
12352321	81.70	95.64	2344.52	15.14	2.41	15.41
12350321	82.05	95.87	2109.09	15.15	2.62	20.34
12349321	82.39	95.84	1831.21	27.44	4.43	18.73
12348321	82.39	96.19	1600.14	27.44	4.67	22.04
12353321	80.53	95.21	1594.71	7.52	1.85	24.86
12345321	83.03	96.34	1341.24	16.78	4.37	26.05
12346321	82.96	96.26	1283.24	15.50	4.47	31.92
12344321	83.11	96.33	1276.88	23.70	6.26	23.05
12347321	82.93	96.23	1218.17	15.45	4.46	30.28
12341321	83.50	96.50	1100.53	29.06	5.11	33.11
12343321	83.36	96.45	1093.03	41.69	7.85	35.47
12333321	84.90	96.96	1025.45	41.72	8.11	40.13
12342321	83.41	96.59	1004.94	30.92	5.60	35.78
12340321	83.62	96.38	989.59	41.72	8.04	34.60
12339321	84.23	96.78	807.21	29.15	6.77	46.92
12337321	84.49	96.76	693.82	64.90	10.75	49.30
12338321	84.43	96.83	647.96	68.93	11.66	53.17
12323321	86.57	97.89	631.88	73.87	15.09	49.22
12336321	84.61	96.74	625.81	73.88	15.18	54.78
12325321	86.49	97.90	620.58	73.88	15.18	54.78
12335321	84.63	97.06	575.53	66.01	14.67	58.38
12321321	86.64	98.02	501.03	116.09	24.20	62.77
12320321	86.89	98.02	375.86	116.14	23.15	92.64
12334321	84.85	96.99	358.25	119.47	24.04	95.01
12331321	85.11	97.38	293.46	30.98	17.53	123.42
12332321	84.93	96.97	247.71	211.79	43.68	127.35
12330321	85.54	97.46	188.35	69.02	35.87	183.65
12319321	87.39	98.31	160.80	73.88	47.69	209.43
12318321	87.47	98.37	149.49	116.09	72.98	213.74
12329321	85.67	97.58	144.25	31.05	33.49	257.59
12317321	87.81	98.37	106.55	116.14	70.97	318.95
12316321	87.92	98.54	104.71	119.65	73.80	332.90
12326321	86.29	97.80	101.09	119.65	73.80	332.90
12328321	86.10	97.76	88.63	69.13	67.26	383.77
12315321	87.98	98.56	71.75	212.03	132.55	445.84
12327321	86.23	97.69	70.56	212.03	132.55	445.84
12313321	88.20	98.53	50.87	119.88	138.02	703.99
12322321	86.60	97.92	50.75	119.88	138.02	703.99
12312321	88.32	98.54	42.53	475.32	292.78	668.76
12314321	88.04	98.40	36.42	212.33	244.75	942.15
12324321	86.52	97.88	36.04	212.33	244.75	942.15
12311321	88.53	98.64	21.76	475.77	534.14	1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}

@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}

作者:

PyTorch Image Models

数据集大小:

886.04 MB