maxvit_xlarge_tf_512.in21k_ft_in1k的模型卡片

官方的MaxViT图像分类模型。在ImageNet-21k（21843个谷歌特定实例的ImageNet-22k）上使用tensorflow进行预训练，并由论文作者在ImageNet-1k上进行微调。

由Ross Wightman将官方的Tensorflow实现（ https://github.com/google-research/maxvit ）转换为PyTorch。

maxxvit.py 中的模型变种

MaxxViT涵盖了许多相关的模型架构，它们共享一个共同的结构，包括：

CoAtNet - 在早期阶段将MBConv（深度可分离）卷积块与后期的自注意力变换块结合起来。
MaxViT - 在所有阶段中均使用统一的块，每个块包含一个MBConv（深度可分离）卷积块，后面跟随两个具有不同划分方案（窗口后跟网格）的自注意力块。
CoAtNeXt - 这是一个timm特定的架构，它在CoAtNet中使用ConvNeXt块而不是MBConv块。所有的归一化层都是LayerNorm（没有BatchNorm）。
MaxxViT - 这是一个timm特定的架构，它在MaxViT中使用ConvNeXt块而不是MBConv块。所有的归一化层都是LayerNorm（没有BatchNorm）。
MaxxViT-V2 - MaxxViT的一个变种，删除了窗口块注意力，只保留了ConvNeXt块和网格注意力，通过增加宽度来进行补偿。

除了上面列出的主要变体外，从一个模型到另一个模型还有一些细微的变化。带有字符串rw的模型名称是timm特定的配置，其中进行了模型调整以支持PyTorch的即时使用。这些模型是在训练模型的初始重现期间创建的，因此变种较多。所有带有字符串tf的模型均与原始论文作者基于Tensorflow的模型完全匹配，将权重移植到PyTorch上。这涵盖了许多MaxViT模型。官方的CoAtNet模型从未发布过。

模型详细信息

模型类型：图像分类/特征骨干
模型统计信息：
- 参数数目（M）：475.8
- GMACs：534.1
- 激活数目（M）：1413.2
- 图像尺寸：512 x 512
论文：
- MaxViT: 多轴视觉变换器： https://arxiv.org/abs/2204.01697
数据集：ImageNet-1k
预训练数据集：ImageNet-21k

模型用法

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('maxvit_xlarge_tf_512.in21k_ft_in1k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特征图提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxvit_xlarge_tf_512.in21k_ft_in1k',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 192, 256, 256])
    #  torch.Size([1, 192, 128, 128])
    #  torch.Size([1, 384, 64, 64])
    #  torch.Size([1, 768, 32, 32])
    #  torch.Size([1, 1536, 16, 16])

    print(o.shape)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'maxvit_xlarge_tf_512.in21k_ft_in1k',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1536, 16, 16) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型比较

按Top-1

model	top1	top5	samples / sec	Params (M)	GMAC	Act (M)
1239321	88.53	98.64	21.76	475.77	534.14	1413.22
12310321	88.32	98.54	42.53	475.32	292.78	668.76
12311321	88.20	98.53	50.87	119.88	138.02	703.99
12312321	88.04	98.40	36.42	212.33	244.75	942.15
12313321	87.98	98.56	71.75	212.03	132.55	445.84
12314321	87.92	98.54	104.71	119.65	73.80	332.90
12315321	87.81	98.37	106.55	116.14	70.97	318.95
12316321	87.47	98.37	149.49	116.09	72.98	213.74
12317321	87.39	98.31	160.80	73.88	47.69	209.43
12318321	86.89	98.02	375.86	116.14	23.15	92.64
12319321	86.64	98.02	501.03	116.09	24.20	62.77
12320321	86.60	97.92	50.75	119.88	138.02	703.99
12321321	86.57	97.89	631.88	73.87	15.09	49.22
12322321	86.52	97.88	36.04	212.33	244.75	942.15
12323321	86.49	97.90	620.58	73.88	15.18	54.78
12324321	86.29	97.80	101.09	119.65	73.80	332.90
12325321	86.23	97.69	70.56	212.03	132.55	445.84
12326321	86.10	97.76	88.63	69.13	67.26	383.77
12327321	85.67	97.58	144.25	31.05	33.49	257.59
12328321	85.54	97.46	188.35	69.02	35.87	183.65
12329321	85.11	97.38	293.46	30.98	17.53	123.42
12330321	84.93	96.97	247.71	211.79	43.68	127.35
12331321	84.90	96.96	1025.45	41.72	8.11	40.13
12332321	84.85	96.99	358.25	119.47	24.04	95.01
12333321	84.63	97.06	575.53	66.01	14.67	58.38
12334321	84.61	96.74	625.81	73.88	15.18	54.78
12335321	84.49	96.76	693.82	64.90	10.75	49.30
12336321	84.43	96.83	647.96	68.93	11.66	53.17
12337321	84.23	96.78	807.21	29.15	6.77	46.92
12338321	83.62	96.38	989.59	41.72	8.04	34.60
12339321	83.50	96.50	1100.53	29.06	5.11	33.11
12340321	83.41	96.59	1004.94	30.92	5.60	35.78
12341321	83.36	96.45	1093.03	41.69	7.85	35.47
12342321	83.11	96.33	1276.88	23.70	6.26	23.05
12343321	83.03	96.34	1341.24	16.78	4.37	26.05
12344321	82.96	96.26	1283.24	15.50	4.47	31.92
12345321	82.93	96.23	1218.17	15.45	4.46	30.28
12346321	82.39	96.19	1600.14	27.44	4.67	22.04
12347321	82.39	95.84	1831.21	27.44	4.43	18.73
12348321	82.05	95.87	2109.09	15.15	2.62	20.34
12349321	81.95	95.92	2525.52	14.70	2.47	12.80
12350321	81.70	95.64	2344.52	15.14	2.41	15.41
12351321	80.53	95.21	1594.71	7.52	1.85	24.86

按吞吐量（样本/秒）

model	top1	top5	samples / sec	Params (M)	GMAC	Act (M)
12349321	81.95	95.92	2525.52	14.70	2.47	12.80
12350321	81.70	95.64	2344.52	15.14	2.41	15.41
12348321	82.05	95.87	2109.09	15.15	2.62	20.34
12347321	82.39	95.84	1831.21	27.44	4.43	18.73
12346321	82.39	96.19	1600.14	27.44	4.67	22.04
12351321	80.53	95.21	1594.71	7.52	1.85	24.86
12343321	83.03	96.34	1341.24	16.78	4.37	26.05
12344321	82.96	96.26	1283.24	15.50	4.47	31.92
12342321	83.11	96.33	1276.88	23.70	6.26	23.05
12345321	82.93	96.23	1218.17	15.45	4.46	30.28
12339321	83.50	96.50	1100.53	29.06	5.11	33.11
12341321	83.36	96.45	1093.03	41.69	7.85	35.47
12331321	84.90	96.96	1025.45	41.72	8.11	40.13
12340321	83.41	96.59	1004.94	30.92	5.60	35.78
12338321	83.62	96.38	989.59	41.72	8.04	34.60
12337321	84.23	96.78	807.21	29.15	6.77	46.92
12335321	84.49	96.76	693.82	64.90	10.75	49.30
12336321	84.43	96.83	647.96	68.93	11.66	53.17
12321321	86.57	97.89	631.88	73.87	15.09	49.22
12334321	84.61	96.74	625.81	73.88	15.18	54.78
12323321	86.49	97.90	620.58	73.88	15.18	54.78
12333321	84.63	97.06	575.53	66.01	14.67	58.38
12319321	86.64	98.02	501.03	116.09	24.20	62.77
12318321	86.89	98.02	375.86	116.14	23.15	92.64
12332321	84.85	96.99	358.25	119.47	24.04	95.01
12329321	85.11	97.38	293.46	30.98	17.53	123.42
12330321	84.93	96.97	247.71	211.79	43.68	127.35
12328321	85.54	97.46	188.35	69.02	35.87	183.65
12317321	87.39	98.31	160.80	73.88	47.69	209.43
12316321	87.47	98.37	149.49	116.09	72.98	213.74
12327321	85.67	97.58	144.25	31.05	33.49	257.59
12315321	87.81	98.37	106.55	116.14	70.97	318.95
12314321	87.92	98.54	104.71	119.65	73.80	332.90
12324321	86.29	97.80	101.09	119.65	73.80	332.90
12326321	86.10	97.76	88.63	69.13	67.26	383.77
12313321	87.98	98.56	71.75	212.03	132.55	445.84
12325321	86.23	97.69	70.56	212.03	132.55	445.84
12311321	88.20	98.53	50.87	119.88	138.02	703.99
12320321	86.60	97.92	50.75	119.88	138.02	703.99
12310321	88.32	98.54	42.53	475.32	292.78	668.76
12312321	88.04	98.40	36.42	212.33	244.75	942.15
12322321	86.52	97.88	36.04	212.33	244.75	942.15
1239321	88.53	98.64	21.76	475.77	534.14	1413.22

引用

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}

@article{dai2021coatnet,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Dai, Zihang and Liu, Hanxiao and Le, Quoc V and Tan, Mingxing},
  journal={arXiv preprint arXiv:2106.04803},
  year={2021}
}

作者:

PyTorch Image Models

数据集大小:

3.55 GB