Command Palette
Search for a command to run...
Wang Limin ; Huang Bingkun ; Zhao Zhiyu ; Tong Zhan ; He Yinan ; Wang Yi ; Wang Yali ; Qiao Yu

Abstract
Scale is the primary factor for building a powerful foundation model thatcould well generalize to a variety of downstream tasks. However, it is stillchallenging to train video foundation models with billions of parameters. Thispaper shows that video masked autoencoder (VideoMAE) is a scalable and generalself-supervised pre-trainer for building video foundation models. We scale theVideoMAE in both model and data with a core design. Specifically, we present adual masking strategy for efficient pre-training, with an encoder operating ona subset of video tokens and a decoder processing another subset of videotokens. Although VideoMAE is very efficient due to high masking ratio inencoder, masking decoder can still further reduce the overall computationalcost. This enables the efficient pre-training of billion-level models in video.We also use a progressive training paradigm that involves an initialpre-training on a diverse multi-sourced unlabeled dataset, followed by apost-pre-training on a mixed labeled dataset. Finally, we successfully train avideo ViT model with a billion parameters, which achieves a newstate-of-the-art performance on the datasets of Kinetics (90.0% on K400 and89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). Inaddition, we extensively verify the pre-trained video ViT models on a varietyof downstream tasks, demonstrating its effectiveness as a general videorepresentation learner. The code and model is available at\url{https://github.com/OpenGVLab/VideoMAEv2}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | VideoMAE V2-g | Acc@1: 88.5 Acc@5: 98.1 |
| action-classification-on-kinetics-400 | VideoMAE V2-g (64x266x266) | Acc@1: 90.0 Acc@5: 98.4 |
| action-classification-on-kinetics-600 | VideoMAE V2-g | Top-1 Accuracy: 88.8 Top-5 Accuracy: 98.2 |
| action-classification-on-kinetics-600 | VideoMAE V2-g (64x266x266) | Top-1 Accuracy: 89.9 Top-5 Accuracy: 98.5 |
| action-recognition-in-videos-on-ava-v2-2 | VideoMAE V2 | mAP (Val): 18.24 |
| action-recognition-in-videos-on-hmdb-51 | VideoMAE V2-g | Average accuracy of 3 splits: 88.1 |
| action-recognition-in-videos-on-something | VideoMAE V2-g | GFLOPs: 2544x6 Parameters: 1013 Top-1 Accuracy: 77.0 Top-5 Accuracy: 95.9 |
| action-recognition-in-videos-on-something-1 | VideoMAE V2-g | Top 1 Accuracy: 68.7 Top 5 Accuracy: 91.9 |
| action-recognition-in-videos-on-ucf101 | VideoMAE V2-g | 3-fold Accuracy: 99.6 |
| action-recognition-on-ava-v2-2 | VideoMAE V2-g | mAP: 42.6 |
| self-supervised-action-recognition-on-ucf101 | VideoMAE V2-g | 3-fold Accuracy: 99.6 |
| spatio-temporal-action-localization-on-ava | VideoMAE V2-g | val mAP: 42.6 |
| temporal-action-localization-on-fineaction | VideoMAE V2-g | mAP: 18.24 mAP IOU@0.5: 29.07 mAP IOU@0.75: 17.66 mAP IOU@0.95: 5.07 |
| temporal-action-localization-on-thumos14 | ActionFormer (VideoMAE V2-g features) | Avg mAP (0.3:0.7): 69.6 mAP IOU@0.3: 84.0 mAP IOU@0.4: 79.6 mAP IOU@0.5: 73.0 mAP IOU@0.6: 63.5 mAP IOU@0.7: 47.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.