5 months ago

Revealing the Dark Secrets of Masked Image Modeling

Zhenda Xie; Zigang Geng; Jingcheng Hu; Zheng Zhang; Han Hu; Yue Cao

Abstract

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.

Code Repositories

SwinTransformer/MIM-Depth-Estimation

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
depth-estimation-on-nyu-depth-v2	SwinV2-B 1K-MIM	RMS: 0.304
depth-estimation-on-nyu-depth-v2	SwinV2-L 1K-MIM	RMS: 0.287
monocular-depth-estimation-on-kitti-eigen	SwinV2-B 1K-MIM	Delta u003c 1.25: 0.976 Delta u003c 1.25^2: 0.998 Delta u003c 1.25^3: 0.999 RMSE: 2.050 RMSE log: 0.078 Sq Rel: 0.148 absolute relative error: 0.052
monocular-depth-estimation-on-kitti-eigen	SwinV2-L 1K-MIM	Delta u003c 1.25: 0.977 Delta u003c 1.25^2: 0.998 Delta u003c 1.25^3: 1.000 RMSE: 1.966 RMSE log: 0.075 Sq Rel: 0.139 absolute relative error: 0.050
monocular-depth-estimation-on-nyu-depth-v2	SwinV2-L 1K-MIM	Delta u003c 1.25: 0.949 Delta u003c 1.25^2: 0.994 Delta u003c 1.25^3: 0.999 RMSE: 0.287 absolute relative error: 0.083 log 10: 0.035
pose-estimation-on-coco-test-dev	SwinV2-L 1K-MIM	AP: 77.2
pose-estimation-on-coco-test-dev	SwinV2-B 1K-MIM	AP: 76.7
pose-estimation-on-crowdpose	SwinV2-L 1K-MIM	AP: 75.5
pose-estimation-on-crowdpose	SwinV2-B 1K-MIM	AP: 74.9
visual-object-tracking-on-got-10k	SwinV2-B 1K-MIM	Average Overlap: 70.8
visual-object-tracking-on-got-10k	SwinV2-L 1K-MIM	Average Overlap: 72.9
visual-object-tracking-on-lasot	SwinV2-B 1K-MIM	AUC: 70
visual-object-tracking-on-lasot	SwinV2-L 1K-MIM	AUC: 70.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette