| LART (Hiera-H, K700 PT+FT) | 45.1 | On the Benefits of 3D Pose and Tracking for Human Action Recognition | |
| Hiera-H (K700 PT+FT) | 43.3 | Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | |
| VideoMAE V2-g | 42.6 | VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | |
| STAR/L | 41.7 | End-to-End Spatio-Temporal Action Localisation with Video Transformers | - |
| MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4) | 41.1 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
| InternVideo | 41.01 | InternVideo: General Video Foundation Models via Generative and Discriminative Learning | |
| MVD (Kinetics400 pretrain, ViT-H, 16x4) | 40.1 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
| MaskFeat (Kinetics-600 pretrain, MViT-L) | 39.8 | Masked Feature Prediction for Self-Supervised Visual Pre-Training | |
| UMT-L (ViT-L/16) | 39.8 | Unmasked Teacher: Towards Training-Efficient Video Foundation Models | |
| VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) | 39.5 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | |
| VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) | 39.3 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | |
| MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4) | 38.7 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
| VideoMAE (K400 pretrain+finetune, ViT-L, 16x4) | 37.8 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | |
| MVD (Kinetics400 pretrain, ViT-L, 16x4) | 37.7 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
| VideoMAE (K400 pretrain, ViT-H, 16x4) | 36.5 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | |
| VideoMAE (K700 pretrain, ViT-L, 16x4) | 36.1 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | |
| MeMViT-24 | 35.4 | MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition | |
| MViTv2-L (IN21k, K700) | 34.4 | MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | |
| VideoMAE (K400 pretrain, ViT-L, 16x4) | 34.3 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | |
| MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4) | 34.2 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |