
摘要
MetaFormer作为Transformer架构的抽象化形式,已被证实能够在实现优异性能方面发挥重要作用。本文进一步探索MetaFormer的潜力,且不聚焦于token mixer的设计:我们基于MetaFormer构建了若干基础模型,采用最基础或最常见的mixer结构,并总结出以下观察结果:(1)MetaFormer能够确保稳定的性能下限。仅采用恒等映射(identity mapping)作为token mixer时,所提出的MetaFormer模型——IdentityFormer,在ImageNet-1K数据集上即可达到超过80%的准确率。(2)MetaFormer对任意token mixer均具有良好的适应性。即使将token mixer设定为一个随机矩阵,所得到的模型RandFormer仍能实现超过81%的准确率,优于IdentityFormer。这表明,无论未来引入何种新型token mixer,MetaFormer均能稳定输出可靠结果。(3)MetaFormer可轻松实现当前最先进水平的性能。仅使用五年前常见的传统token mixer,基于MetaFormer构建的模型已超越现有SOTA(state-of-the-art)水平。(a) ConvFormer超越ConvNeXt。当采用常见的深度可分离卷积(depthwise separable convolutions)作为token mixer时,所构建的模型ConvFormer可被视为纯卷积神经网络(CNN),其性能仍显著优于强基准模型ConvNeXt。(b) CAFormer在ImageNet-1K上创下新纪录。通过在底层阶段使用深度可分离卷积作为token mixer,在顶层阶段采用标准自注意力机制(vanilla self-attention),所得到的CAFormer在224×224分辨率下,仅通过常规监督训练(无外部数据或知识蒸馏),即达到85.5%的准确率,刷新了ImageNet-1K上的新纪录。在对MetaFormer的深入探究中,我们还发现一种新型激活函数StarReLU,其计算开销相比GELU可减少71%的浮点运算量(FLOPs),同时性能更优。我们预期StarReLU在MetaFormer类模型以及其他神经网络架构中均具有广阔的应用潜力。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| domain-generalization-on-imagenet-a | CAFormer-B36 (IN-21K) | Number of params: 99M Top-1 accuracy %: 69.4 |
| domain-generalization-on-imagenet-a | CAFormer-B36 | Number of params: 99M Top-1 accuracy %: 48.5 |
| domain-generalization-on-imagenet-a | CAFormer-B36 (IN-21K, 384) | Number of params: 99M Top-1 accuracy %: 79.5 |
| domain-generalization-on-imagenet-a | ConvFormer-B36 (384) | Number of params: 100M Top-1 accuracy %: 55.3 |
| domain-generalization-on-imagenet-a | ConvFormer-B36 (IN-21K) | Number of params: 100M Top-1 accuracy %: 63.3 |
| domain-generalization-on-imagenet-a | CAFormer-B36 (384) | Number of params: 99M Top-1 accuracy %: 61.9 |
| domain-generalization-on-imagenet-a | ConvFormer-B36 (IN-21K, 384) | Number of params: 100M Top-1 accuracy %: 73.5 |
| domain-generalization-on-imagenet-a | ConvFormer-B36 | Number of params: 100M Top-1 accuracy %: 40.1 |
| domain-generalization-on-imagenet-c | CAFormer-B36 (IN21K, 384) | Number of params: 99M mean Corruption Error (mCE): 30.8 |
| domain-generalization-on-imagenet-c | CAFormer-B36 (IN21K) | mean Corruption Error (mCE): 31.8 |
| domain-generalization-on-imagenet-c | ConvFormer-B36 | mean Corruption Error (mCE): 46.3 |
| domain-generalization-on-imagenet-c | CAFormer-B36 | mean Corruption Error (mCE): 42.6 |
| domain-generalization-on-imagenet-c | ConvFormer-B36 (IN21K) | mean Corruption Error (mCE): 35.0 |
| domain-generalization-on-imagenet-r | ConvFormer-B36 | Top-1 Error Rate: 48.9 |
| domain-generalization-on-imagenet-r | ConvFormer-B36 (384) | Top-1 Error Rate: 47.8 |
| domain-generalization-on-imagenet-r | ConvFormer-B36 (IN21K, 384) | Top-1 Error Rate: 33.5 |
| domain-generalization-on-imagenet-r | CAFormer-B36 | Top-1 Error Rate: 46.1 |
| domain-generalization-on-imagenet-r | CAFormer-B36 (384) | Top-1 Error Rate: 45 |
| domain-generalization-on-imagenet-r | CAFormer-B36 (IN21K) | Top-1 Error Rate: 31.7 |
| domain-generalization-on-imagenet-r | CAFormer-B36 (IN21K, 384) | Top-1 Error Rate: 29.6 |
| domain-generalization-on-imagenet-r | ConvFormer-B36 (IN21K) | Top-1 Error Rate: 34.7 |
| domain-generalization-on-imagenet-sketch | CAFormer-B36 (IN21K, 384) | Top-1 accuracy: 54.5 |
| domain-generalization-on-imagenet-sketch | ConvFormer-B36 (IN21K, 384) | Top-1 accuracy: 52.9 |
| domain-generalization-on-imagenet-sketch | CAFormer-B36 | Top-1 accuracy: 42.5 |
| domain-generalization-on-imagenet-sketch | ConvFormer-B36 | Top-1 accuracy: 39.5 |
| domain-generalization-on-imagenet-sketch | CAFormer-B36 (IN21K) | Top-1 accuracy: 52.8 |
| domain-generalization-on-imagenet-sketch | ConvFormer-B36 (IN21K) | Top-1 accuracy: 52.7 |
| image-classification-on-imagenet | ConvFormer-S36 (224 res, 21K) | GFLOPs: 7.6 Number of params: 40M Top 1 Accuracy: 85.4% |
| image-classification-on-imagenet | CAFormer-M36 (224 res) | GFLOPs: 13.2 Number of params: 56M Top 1 Accuracy: 85.2% |
| image-classification-on-imagenet | ConvFormer-S18 (384 res, 21K) | GFLOPs: 11.6 Number of params: 27M Top 1 Accuracy: 85.0% |
| image-classification-on-imagenet | ConvFormer-S36 (384 res, 21K) | GFLOPs: 22.4 Number of params: 40M Top 1 Accuracy: 86.4% |
| image-classification-on-imagenet | CAFormer-S36 (224 res) | GFLOPs: 8.0 Number of params: 39M Top 1 Accuracy: 84.5% |
| image-classification-on-imagenet | CAFormer-S36 (224 res, 21K) | GFLOPs: 8.0 Number of params: 39M Top 1 Accuracy: 85.8% |
| image-classification-on-imagenet | CAFormer-S18 (224 res) | GFLOPs: 4.1 Number of params: 26M Top 1 Accuracy: 83.6% |
| image-classification-on-imagenet | ConvFormer-B36 (384 res) | GFLOPs: 66.5 Number of params: 100M Top 1 Accuracy: 85.7% |
| image-classification-on-imagenet | ConvFormer-M36 (224 res) | GFLOPs: 12.8 Number of params: 57M Top 1 Accuracy: 84.5% |
| image-classification-on-imagenet | ConvFormer-S36 (224 res) | GFLOPs: 7.6 Number of params: 40M Top 1 Accuracy: 84.1% |
| image-classification-on-imagenet | ConvFormer-S18 (224 res) | GFLOPs: 3.9 Number of params: 27M Top 1 Accuracy: 83.0% |
| image-classification-on-imagenet | ConvFormer-B36 (384 res, 21K) | GFLOPs: 66.5 Number of params: 100M Top 1 Accuracy: 87.6% |
| image-classification-on-imagenet | ConvFormer-S18 (224 res, 21K) | GFLOPs: 3.9 Number of params: 27M Top 1 Accuracy: 83.7% |
| image-classification-on-imagenet | CAFormer-S18 (384 res) | GFLOPs: 13.4 Number of params: 26M Top 1 Accuracy: 85.0% |
| image-classification-on-imagenet | ConvFormer-M36 (224 res, 21K) | GFLOPs: 12.8 Number of params: 57M Top 1 Accuracy: 86.1% |
| image-classification-on-imagenet | CAFormer-S18 (384 res, 21K) | GFLOPs: 13.4 Number of params: 26M Top 1 Accuracy: 85.4% |
| image-classification-on-imagenet | CAFormer-B36 (384 res) | GFLOPs: 72.2 Number of params: 99M Top 1 Accuracy: 86.4% |
| image-classification-on-imagenet | CAFormer-M36 (224 res, 21K) | GFLOPs: 13.2 Number of params: 56M Top 1 Accuracy: 86.6% |
| image-classification-on-imagenet | ConvFormer-S36 (384 res) | GFLOPs: 22.4 Number of params: 40M Top 1 Accuracy: 85.4% |
| image-classification-on-imagenet | CAFormer-S36 (384 res, 21K) | GFLOPs: 26.0 Number of params: 39M Top 1 Accuracy: 86.9% |
| image-classification-on-imagenet | CAFormer-S18 (224 res, 21K) | GFLOPs: 4.1 Number of params: 26M Top 1 Accuracy: 84.1% |
| image-classification-on-imagenet | CAFormer-M36 (384 res, 21K) | GFLOPs: 42 Number of params: 56M Top 1 Accuracy: 87.5% |
| image-classification-on-imagenet | CAFormer-M36 (384 res) | GFLOPs: 42.0 Number of params: 56M Top 1 Accuracy: 86.2% |
| image-classification-on-imagenet | ConvFormer-B36 (224 res, 21K) | GFLOPs: 22.6 Number of params: 100M Top 1 Accuracy: 87.0% |
| image-classification-on-imagenet | ConvFormer-B36 (224 res) | GFLOPs: 22.6 Number of params: 100M Top 1 Accuracy: 84.8% |
| image-classification-on-imagenet | ConvFormer-M36 (384 res, 21K) | GFLOPs: 37.7 Number of params: 57M Top 1 Accuracy: 86.9% |
| image-classification-on-imagenet | ConvFormer-S18 (384 res) | GFLOPs: 11.6 Number of params: 27M Top 1 Accuracy: 84.4% |
| image-classification-on-imagenet | CAFormer-B36 (224 res) | GFLOPs: 23.2 Number of params: 99M Top 1 Accuracy: 85.5% |
| image-classification-on-imagenet | CAFormer-B36 (384 res, 21K) | GFLOPs: 72.2 Number of params: 99M Top 1 Accuracy: 88.1% |
| image-classification-on-imagenet | CAFormer-B36 (224 res, 21K) | GFLOPs: 23.2 Number of params: 99M Top 1 Accuracy: 87.4% |
| image-classification-on-imagenet | ConvFormer-M36 (384 res) | GFLOPs: 37.7 Number of params: 57M Top 1 Accuracy: 85.6% |
| image-classification-on-imagenet | CAFormer-S36 (384 res) | GFLOPs: 26.0 Number of params: 39M Top 1 Accuracy: 85.7% |