3 个月前

视觉任务中的MetaFormer基线

视觉任务中的MetaFormer基线

摘要

MetaFormer作为Transformer架构的抽象化形式,已被证实能够在实现优异性能方面发挥重要作用。本文进一步探索MetaFormer的潜力,且不聚焦于token mixer的设计:我们基于MetaFormer构建了若干基础模型,采用最基础或最常见的mixer结构,并总结出以下观察结果:(1)MetaFormer能够确保稳定的性能下限。仅采用恒等映射(identity mapping)作为token mixer时,所提出的MetaFormer模型——IdentityFormer,在ImageNet-1K数据集上即可达到超过80%的准确率。(2)MetaFormer对任意token mixer均具有良好的适应性。即使将token mixer设定为一个随机矩阵,所得到的模型RandFormer仍能实现超过81%的准确率,优于IdentityFormer。这表明,无论未来引入何种新型token mixer,MetaFormer均能稳定输出可靠结果。(3)MetaFormer可轻松实现当前最先进水平的性能。仅使用五年前常见的传统token mixer,基于MetaFormer构建的模型已超越现有SOTA(state-of-the-art)水平。(a) ConvFormer超越ConvNeXt。当采用常见的深度可分离卷积(depthwise separable convolutions)作为token mixer时,所构建的模型ConvFormer可被视为纯卷积神经网络(CNN),其性能仍显著优于强基准模型ConvNeXt。(b) CAFormer在ImageNet-1K上创下新纪录。通过在底层阶段使用深度可分离卷积作为token mixer,在顶层阶段采用标准自注意力机制(vanilla self-attention),所得到的CAFormer在224×224分辨率下,仅通过常规监督训练(无外部数据或知识蒸馏),即达到85.5%的准确率,刷新了ImageNet-1K上的新纪录。在对MetaFormer的深入探究中,我们还发现一种新型激活函数StarReLU,其计算开销相比GELU可减少71%的浮点运算量(FLOPs),同时性能更优。我们预期StarReLU在MetaFormer类模型以及其他神经网络架构中均具有广阔的应用潜力。

基准测试

基准方法指标
domain-generalization-on-imagenet-aCAFormer-B36 (IN-21K)
Number of params: 99M
Top-1 accuracy %: 69.4
domain-generalization-on-imagenet-aCAFormer-B36
Number of params: 99M
Top-1 accuracy %: 48.5
domain-generalization-on-imagenet-aCAFormer-B36 (IN-21K, 384)
Number of params: 99M
Top-1 accuracy %: 79.5
domain-generalization-on-imagenet-aConvFormer-B36 (384)
Number of params: 100M
Top-1 accuracy %: 55.3
domain-generalization-on-imagenet-aConvFormer-B36 (IN-21K)
Number of params: 100M
Top-1 accuracy %: 63.3
domain-generalization-on-imagenet-aCAFormer-B36 (384)
Number of params: 99M
Top-1 accuracy %: 61.9
domain-generalization-on-imagenet-aConvFormer-B36 (IN-21K, 384)
Number of params: 100M
Top-1 accuracy %: 73.5
domain-generalization-on-imagenet-aConvFormer-B36
Number of params: 100M
Top-1 accuracy %: 40.1
domain-generalization-on-imagenet-cCAFormer-B36 (IN21K, 384)
Number of params: 99M
mean Corruption Error (mCE): 30.8
domain-generalization-on-imagenet-cCAFormer-B36 (IN21K)
mean Corruption Error (mCE): 31.8
domain-generalization-on-imagenet-cConvFormer-B36
mean Corruption Error (mCE): 46.3
domain-generalization-on-imagenet-cCAFormer-B36
mean Corruption Error (mCE): 42.6
domain-generalization-on-imagenet-cConvFormer-B36 (IN21K)
mean Corruption Error (mCE): 35.0
domain-generalization-on-imagenet-rConvFormer-B36
Top-1 Error Rate: 48.9
domain-generalization-on-imagenet-rConvFormer-B36 (384)
Top-1 Error Rate: 47.8
domain-generalization-on-imagenet-rConvFormer-B36 (IN21K, 384)
Top-1 Error Rate: 33.5
domain-generalization-on-imagenet-rCAFormer-B36
Top-1 Error Rate: 46.1
domain-generalization-on-imagenet-rCAFormer-B36 (384)
Top-1 Error Rate: 45
domain-generalization-on-imagenet-rCAFormer-B36 (IN21K)
Top-1 Error Rate: 31.7
domain-generalization-on-imagenet-rCAFormer-B36 (IN21K, 384)
Top-1 Error Rate: 29.6
domain-generalization-on-imagenet-rConvFormer-B36 (IN21K)
Top-1 Error Rate: 34.7
domain-generalization-on-imagenet-sketchCAFormer-B36 (IN21K, 384)
Top-1 accuracy: 54.5
domain-generalization-on-imagenet-sketchConvFormer-B36 (IN21K, 384)
Top-1 accuracy: 52.9
domain-generalization-on-imagenet-sketchCAFormer-B36
Top-1 accuracy: 42.5
domain-generalization-on-imagenet-sketchConvFormer-B36
Top-1 accuracy: 39.5
domain-generalization-on-imagenet-sketchCAFormer-B36 (IN21K)
Top-1 accuracy: 52.8
domain-generalization-on-imagenet-sketchConvFormer-B36 (IN21K)
Top-1 accuracy: 52.7
image-classification-on-imagenetConvFormer-S36 (224 res, 21K)
GFLOPs: 7.6
Number of params: 40M
Top 1 Accuracy: 85.4%
image-classification-on-imagenetCAFormer-M36 (224 res)
GFLOPs: 13.2
Number of params: 56M
Top 1 Accuracy: 85.2%
image-classification-on-imagenetConvFormer-S18 (384 res, 21K)
GFLOPs: 11.6
Number of params: 27M
Top 1 Accuracy: 85.0%
image-classification-on-imagenetConvFormer-S36 (384 res, 21K)
GFLOPs: 22.4
Number of params: 40M
Top 1 Accuracy: 86.4%
image-classification-on-imagenetCAFormer-S36 (224 res)
GFLOPs: 8.0
Number of params: 39M
Top 1 Accuracy: 84.5%
image-classification-on-imagenetCAFormer-S36 (224 res, 21K)
GFLOPs: 8.0
Number of params: 39M
Top 1 Accuracy: 85.8%
image-classification-on-imagenetCAFormer-S18 (224 res)
GFLOPs: 4.1
Number of params: 26M
Top 1 Accuracy: 83.6%
image-classification-on-imagenetConvFormer-B36 (384 res)
GFLOPs: 66.5
Number of params: 100M
Top 1 Accuracy: 85.7%
image-classification-on-imagenetConvFormer-M36 (224 res)
GFLOPs: 12.8
Number of params: 57M
Top 1 Accuracy: 84.5%
image-classification-on-imagenetConvFormer-S36 (224 res)
GFLOPs: 7.6
Number of params: 40M
Top 1 Accuracy: 84.1%
image-classification-on-imagenetConvFormer-S18 (224 res)
GFLOPs: 3.9
Number of params: 27M
Top 1 Accuracy: 83.0%
image-classification-on-imagenetConvFormer-B36 (384 res, 21K)
GFLOPs: 66.5
Number of params: 100M
Top 1 Accuracy: 87.6%
image-classification-on-imagenetConvFormer-S18 (224 res, 21K)
GFLOPs: 3.9
Number of params: 27M
Top 1 Accuracy: 83.7%
image-classification-on-imagenetCAFormer-S18 (384 res)
GFLOPs: 13.4
Number of params: 26M
Top 1 Accuracy: 85.0%
image-classification-on-imagenetConvFormer-M36 (224 res, 21K)
GFLOPs: 12.8
Number of params: 57M
Top 1 Accuracy: 86.1%
image-classification-on-imagenetCAFormer-S18 (384 res, 21K)
GFLOPs: 13.4
Number of params: 26M
Top 1 Accuracy: 85.4%
image-classification-on-imagenetCAFormer-B36 (384 res)
GFLOPs: 72.2
Number of params: 99M
Top 1 Accuracy: 86.4%
image-classification-on-imagenetCAFormer-M36 (224 res, 21K)
GFLOPs: 13.2
Number of params: 56M
Top 1 Accuracy: 86.6%
image-classification-on-imagenetConvFormer-S36 (384 res)
GFLOPs: 22.4
Number of params: 40M
Top 1 Accuracy: 85.4%
image-classification-on-imagenetCAFormer-S36 (384 res, 21K)
GFLOPs: 26.0
Number of params: 39M
Top 1 Accuracy: 86.9%
image-classification-on-imagenetCAFormer-S18 (224 res, 21K)
GFLOPs: 4.1
Number of params: 26M
Top 1 Accuracy: 84.1%
image-classification-on-imagenetCAFormer-M36 (384 res, 21K)
GFLOPs: 42
Number of params: 56M
Top 1 Accuracy: 87.5%
image-classification-on-imagenetCAFormer-M36 (384 res)
GFLOPs: 42.0
Number of params: 56M
Top 1 Accuracy: 86.2%
image-classification-on-imagenetConvFormer-B36 (224 res, 21K)
GFLOPs: 22.6
Number of params: 100M
Top 1 Accuracy: 87.0%
image-classification-on-imagenetConvFormer-B36 (224 res)
GFLOPs: 22.6
Number of params: 100M
Top 1 Accuracy: 84.8%
image-classification-on-imagenetConvFormer-M36 (384 res, 21K)
GFLOPs: 37.7
Number of params: 57M
Top 1 Accuracy: 86.9%
image-classification-on-imagenetConvFormer-S18 (384 res)
GFLOPs: 11.6
Number of params: 27M
Top 1 Accuracy: 84.4%
image-classification-on-imagenetCAFormer-B36 (224 res)
GFLOPs: 23.2
Number of params: 99M
Top 1 Accuracy: 85.5%
image-classification-on-imagenetCAFormer-B36 (384 res, 21K)
GFLOPs: 72.2
Number of params: 99M
Top 1 Accuracy: 88.1%
image-classification-on-imagenetCAFormer-B36 (224 res, 21K)
GFLOPs: 23.2
Number of params: 99M
Top 1 Accuracy: 87.4%
image-classification-on-imagenetConvFormer-M36 (384 res)
GFLOPs: 37.7
Number of params: 57M
Top 1 Accuracy: 85.6%
image-classification-on-imagenetCAFormer-S36 (384 res)
GFLOPs: 26.0
Number of params: 39M
Top 1 Accuracy: 85.7%

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
视觉任务中的MetaFormer基线 | 论文 | HyperAI超神经