Command Palette
Search for a command to run...
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong; Yibing Song; Jue Wang; Limin Wang

Abstract
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | VideoMAE (no extra data, ViT-H, 32x320x320) | Acc@1: 87.4 Acc@5: 97.6 |
| action-classification-on-kinetics-400 | VideoMAE (no extra data, ViT-H) | Acc@1: 86.6 Acc@5: 97.1 |
| action-classification-on-kinetics-400 | VideoMAE (no extra data, ViT-B, 16x4) | Acc@1: 81.5 Acc@5: 95.1 |
| action-classification-on-kinetics-400 | VideoMAE (no extra data, ViT-L, 32x320x320) | Acc@1: 86.1 Acc@5: 97.3 |
| action-classification-on-kinetics-400 | VideoMAE (no extra data, ViT-L, 16x4) | Acc@1: 85.2 Acc@5: 96.8 |
| action-recognition-in-videos-on-something | VideoMAE (no extra data, ViT-B, 16frame) | GFLOPs: 180x6 Parameters: 87 Top-1 Accuracy: 70.8 Top-5 Accuracy: 92.4 |
| action-recognition-in-videos-on-something | VideoMAE (no extra data, ViT-L, 32x2) | GFLOPs: 1436x3 Parameters: 305 Top-1 Accuracy: 75.4 Top-5 Accuracy: 95.2 |
| action-recognition-in-videos-on-something | VideoMAE (no extra data, ViT-L, 16frame) | GFLOPs: 597x6 Parameters: 305 Top-1 Accuracy: 74.3 Top-5 Accuracy: 94.6 |
| action-recognition-on-ava-v2-2 | VideoMAE (K700 pretrain, ViT-L, 16x4) | mAP: 36.1 |
| action-recognition-on-ava-v2-2 | VideoMAE (K400 pretrain, ViT-B, 16x4) | mAP: 26.7 |
| action-recognition-on-ava-v2-2 | VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) | mAP: 39.5 |
| action-recognition-on-ava-v2-2 | VideoMAE (K400 pretrain, ViT-L, 16x4) | mAP: 34.3 |
| action-recognition-on-ava-v2-2 | VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) | mAP: 39.3 |
| action-recognition-on-ava-v2-2 | VideoMAE (K400 pretrain+finetune, ViT-L, 16x4) | mAP: 37.8 |
| action-recognition-on-ava-v2-2 | VideoMAE (K400 pretrain+finetune, ViT-B, 16x4) | mAP: 31.8 |
| action-recognition-on-ava-v2-2 | VideoMAE (K400 pretrain, ViT-H, 16x4) | mAP: 36.5 |
| self-supervised-action-recognition-on-hmdb51 | VideoMAE | Frozen: false Pre-Training Dataset: Kinetics400 Top-1 Accuracy: 73.3 |
| self-supervised-action-recognition-on-hmdb51 | VideoMAE(no extra data) | Frozen: false Pre-Training Dataset: no extra data Top-1 Accuracy: 62.6 |
| self-supervised-action-recognition-on-ucf101 | VideoMAE(no extra data) | 3-fold Accuracy: 91.3 Frozen: false Pre-Training Dataset: no extra data |
| self-supervised-action-recognition-on-ucf101 | VideoMAE | 3-fold Accuracy: 96.1 Frozen: false Pre-Training Dataset: Kinetics400 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.