5 months ago

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Wang Limin ; Huang Bingkun ; Zhao Zhiyu ; Tong Zhan ; He Yinan ; Wang Yi ; Wang Yali ; Qiao Yu

Abstract

Scale is the primary factor for building a powerful foundation model thatcould well generalize to a variety of downstream tasks. However, it is stillchallenging to train video foundation models with billions of parameters. Thispaper shows that video masked autoencoder (VideoMAE) is a scalable and generalself-supervised pre-trainer for building video foundation models. We scale theVideoMAE in both model and data with a core design. Specifically, we present adual masking strategy for efficient pre-training, with an encoder operating ona subset of video tokens and a decoder processing another subset of videotokens. Although VideoMAE is very efficient due to high masking ratio inencoder, masking decoder can still further reduce the overall computationalcost. This enables the efficient pre-training of billion-level models in video.We also use a progressive training paradigm that involves an initialpre-training on a diverse multi-sourced unlabeled dataset, followed by apost-pre-training on a mixed labeled dataset. Finally, we successfully train avideo ViT model with a billion parameters, which achieves a newstate-of-the-art performance on the datasets of Kinetics (90.0% on K400 and89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). Inaddition, we extensively verify the pre-trained video ViT models on a varietyof downstream tasks, demonstrating its effectiveness as a general videorepresentation learner. The code and model is available at\url{https://github.com/OpenGVLab/VideoMAEv2}.

Code Repositories

OpenGVLab/VideoMAEv2

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
action-classification-on-kinetics-400	VideoMAE V2-g	Acc@1: 88.5 Acc@5: 98.1
action-classification-on-kinetics-400	VideoMAE V2-g (64x266x266)	Acc@1: 90.0 Acc@5: 98.4
action-classification-on-kinetics-600	VideoMAE V2-g	Top-1 Accuracy: 88.8 Top-5 Accuracy: 98.2
action-classification-on-kinetics-600	VideoMAE V2-g (64x266x266)	Top-1 Accuracy: 89.9 Top-5 Accuracy: 98.5
action-recognition-in-videos-on-ava-v2-2	VideoMAE V2	mAP (Val): 18.24
action-recognition-in-videos-on-hmdb-51	VideoMAE V2-g	Average accuracy of 3 splits: 88.1
action-recognition-in-videos-on-something	VideoMAE V2-g	GFLOPs: 2544x6 Parameters: 1013 Top-1 Accuracy: 77.0 Top-5 Accuracy: 95.9
action-recognition-in-videos-on-something-1	VideoMAE V2-g	Top 1 Accuracy: 68.7 Top 5 Accuracy: 91.9
action-recognition-in-videos-on-ucf101	VideoMAE V2-g	3-fold Accuracy: 99.6
action-recognition-on-ava-v2-2	VideoMAE V2-g	mAP: 42.6
self-supervised-action-recognition-on-ucf101	VideoMAE V2-g	3-fold Accuracy: 99.6
spatio-temporal-action-localization-on-ava	VideoMAE V2-g	val mAP: 42.6
temporal-action-localization-on-fineaction	VideoMAE V2-g	mAP: 18.24 mAP IOU@0.5: 29.07 mAP IOU@0.75: 17.66 mAP IOU@0.95: 5.07
temporal-action-localization-on-thumos14	ActionFormer (VideoMAE V2-g features)	Avg mAP (0.3:0.7): 69.6 mAP IOU@0.3: 84.0 mAP IOU@0.4: 79.6 mAP IOU@0.5: 73.0 mAP IOU@0.6: 63.5 mAP IOU@0.7: 47.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette