HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Wang Limin ; Huang Bingkun ; Zhao Zhiyu ; Tong Zhan ; He Yinan ; Wang Yi ; Wang Yali ; Qiao Yu

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Abstract

Scale is the primary factor for building a powerful foundation model thatcould well generalize to a variety of downstream tasks. However, it is stillchallenging to train video foundation models with billions of parameters. Thispaper shows that video masked autoencoder (VideoMAE) is a scalable and generalself-supervised pre-trainer for building video foundation models. We scale theVideoMAE in both model and data with a core design. Specifically, we present adual masking strategy for efficient pre-training, with an encoder operating ona subset of video tokens and a decoder processing another subset of videotokens. Although VideoMAE is very efficient due to high masking ratio inencoder, masking decoder can still further reduce the overall computationalcost. This enables the efficient pre-training of billion-level models in video.We also use a progressive training paradigm that involves an initialpre-training on a diverse multi-sourced unlabeled dataset, followed by apost-pre-training on a mixed labeled dataset. Finally, we successfully train avideo ViT model with a billion parameters, which achieves a newstate-of-the-art performance on the datasets of Kinetics (90.0% on K400 and89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). Inaddition, we extensively verify the pre-trained video ViT models on a varietyof downstream tasks, demonstrating its effectiveness as a general videorepresentation learner. The code and model is available at\url{https://github.com/OpenGVLab/VideoMAEv2}.

Code Repositories

OpenGVLab/VideoMAEv2
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400VideoMAE V2-g
Acc@1: 88.5
Acc@5: 98.1
action-classification-on-kinetics-400VideoMAE V2-g (64x266x266)
Acc@1: 90.0
Acc@5: 98.4
action-classification-on-kinetics-600VideoMAE V2-g
Top-1 Accuracy: 88.8
Top-5 Accuracy: 98.2
action-classification-on-kinetics-600VideoMAE V2-g (64x266x266)
Top-1 Accuracy: 89.9
Top-5 Accuracy: 98.5
action-recognition-in-videos-on-ava-v2-2VideoMAE V2
mAP (Val): 18.24
action-recognition-in-videos-on-hmdb-51VideoMAE V2-g
Average accuracy of 3 splits: 88.1
action-recognition-in-videos-on-somethingVideoMAE V2-g
GFLOPs: 2544x6
Parameters: 1013
Top-1 Accuracy: 77.0
Top-5 Accuracy: 95.9
action-recognition-in-videos-on-something-1VideoMAE V2-g
Top 1 Accuracy: 68.7
Top 5 Accuracy: 91.9
action-recognition-in-videos-on-ucf101VideoMAE V2-g
3-fold Accuracy: 99.6
action-recognition-on-ava-v2-2VideoMAE V2-g
mAP: 42.6
self-supervised-action-recognition-on-ucf101VideoMAE V2-g
3-fold Accuracy: 99.6
spatio-temporal-action-localization-on-avaVideoMAE V2-g
val mAP: 42.6
temporal-action-localization-on-fineactionVideoMAE V2-g
mAP: 18.24
mAP IOU@0.5: 29.07
mAP IOU@0.75: 17.66
mAP IOU@0.95: 5.07
temporal-action-localization-on-thumos14ActionFormer (VideoMAE V2-g features)
Avg mAP (0.3:0.7): 69.6
mAP IOU@0.3: 84.0
mAP IOU@0.4: 79.6
mAP IOU@0.5: 73.0
mAP IOU@0.6: 63.5
mAP IOU@0.7: 47.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp