Action Recognition On Ava V2 2

评估指标

mAP

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
LART (Hiera-H, K700 PT+FT)45.1On the Benefits of 3D Pose and Tracking for Human Action Recognition
Hiera-H (K700 PT+FT)43.3Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
VideoMAE V2-g42.6VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
STAR/L41.7End-to-End Spatio-Temporal Action Localisation with Video Transformers-
MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4)41.1Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
InternVideo41.01InternVideo: General Video Foundation Models via Generative and Discriminative Learning
MVD (Kinetics400 pretrain, ViT-H, 16x4)40.1Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
MaskFeat (Kinetics-600 pretrain, MViT-L)39.8Masked Feature Prediction for Self-Supervised Visual Pre-Training
UMT-L (ViT-L/16)39.8Unmasked Teacher: Towards Training-Efficient Video Foundation Models
VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)39.5VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)39.3VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4)38.7Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)37.8VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
MVD (Kinetics400 pretrain, ViT-L, 16x4)37.7Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
VideoMAE (K400 pretrain, ViT-H, 16x4)36.5VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
VideoMAE (K700 pretrain, ViT-L, 16x4)36.1VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
MeMViT-2435.4MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
MViTv2-L (IN21k, K700)34.4MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
VideoMAE (K400 pretrain, ViT-L, 16x4)34.3VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4)34.2Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
0 of 38 row(s) selected.
Action Recognition On Ava V2 2 | SOTA | HyperAI超神经