
摘要
掩码图像建模(Masked Image Modeling, MIM)作为一种新兴的自监督预训练方法,在视觉Transformer模型上已在众多下游视觉任务中展现出卓越的性能。其核心思想简洁明了:对输入图像的部分区域进行掩码处理,随后通过预训练任务实现图像的重建。然而,MIM的内在工作机制尚未得到充分阐释,以往研究普遍认为MIM主要适用于Transformer架构,难以与卷积神经网络(CNN)兼容。在本工作中,我们发现MIM本质上是引导模型学习图像块之间更优的中阶交互关系,从而实现更具泛化能力的特征提取。基于此,我们提出了一种架构无关的掩码图像建模框架——A²MIM(Architecture-Agnostic Masked Image Modeling),该框架以统一方式兼容Transformer与CNN架构。在多个主流基准上的大量实验表明,A²MIM无需显式设计即可学习到更优的表征,并显著增强骨干网络向各类下游任务迁移的能力。
代码仓库
Westlake-AI/openmixup
官方
pytorch
GitHub 中提及
open-mmlab/mmpretrain
pytorch
Westlake-AI/A2MIM
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| instance-segmentation-on-coco | A2MIM (ViT-B) | mask AP: 43.5 |
| instance-segmentation-on-coco | A2MIM (ResNet-50 2x) | mask AP: 34.9 |
| object-detection-on-coco | A2MIM (ViT-B) | box mAP: 49.4 |
| object-detection-on-coco | A2MIM (ResNet-50 2x) | box mAP: 39.8 |
| self-supervised-image-classification-on-1 | A2MIM (ResNet-50 RSB-A2) | Top 1 Accuracy: 80.4% |
| self-supervised-image-classification-on-1 | A2MIM+ (ViT-B) | Top 1 Accuracy: 84.5% |
| self-supervised-image-classification-on-1 | A2MIM+ (ViT-S) | Top 1 Accuracy: 82.4% |
| self-supervised-image-classification-on-1 | A2MIM (ViT-B) | Top 1 Accuracy: 84.2% |
| self-supervised-image-classification-on-1 | A2MIM+ (ResNet-50 RSB-A3) | Top 1 Accuracy: 78.9% |
| self-supervised-image-classification-on-1 | A2MIM (ResNet-50 RSB-A3) | Top 1 Accuracy: 78.8% |
| self-supervised-image-classification-on-1 | A2MIM+ (ResNet-50 RSB-A2) | Top 1 Accuracy: 80.5% |
| self-supervised-image-classification-on-1 | A2MIM (ViT-S) | Top 1 Accuracy: 82.2% |
| semantic-segmentation-on-ade20k | A2MIM (ResNet-50) | Validation mIoU: 38.3 |
| semantic-segmentation-on-ade20k | A2MIM (ViT-B) | Validation mIoU: 49 |