| MVD (Kinetics400 pretrain, ViT-H, 16 frame) | 77.3 | 95.7 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
| InternVideo | 77.2 | - | InternVideo: General Video Foundation Models via Generative and Discriminative Learning | |
| InternVideo2-1B | 77.1 | - | InternVideo2: Scaling Foundation Models for Multimodal Video
Understanding | |
| VideoMAE V2-g | 77.0 | 95.9 | VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | |
| MVD (Kinetics400 pretrain, ViT-L, 16 frame) | 76.7 | 95.5 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
| Hiera-L (no extra data) | 76.5 | - | Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | |
| TubeViT-L | 76.1 | 95.2 | Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
Learning | |
| VideoMAE (no extra data, ViT-L, 32x2) | 75.4 | 95.2 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | |
| Side4Video (EVA ViT-E/14) | 75.2 | 94.0 | Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | |
| MaskFeat (Kinetics600 pretrain, MViT-L) | 75.0 | 95.0 | Masked Feature Prediction for Self-Supervised Visual Pre-Training | |
| MAR (50% mask, ViT-L, 16x4) | 74.7 | 94.9 | MAR: Masked Autoencoders for Efficient Action Recognition | |
| ATM | 74.6 | 94.4 | What Can Simple Arithmetic Operations Do for Temporal Modeling? | |
| MAWS (ViT-L) | 74.4 | - | The effectiveness of MAE pre-pretraining for billion-scale pretraining | |
| VideoMAE (no extra data, ViT-L, 16frame) | 74.3 | 94.6 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | |
| MAR (75% mask, ViT-L, 16x4) | 73.8 | 94.4 | MAR: Masked Autoencoders for Efficient Action Recognition | |
| ViC-MAE (ViT-L) | 73.7 | - | ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders | |
| MVD (Kinetics400 pretrain, ViT-B, 16 frame) | 73.7 | 94.0 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
| TAdaFormer-L/14 | 73.6 | - | Temporally-Adaptive Models for Efficient Video Understanding | |
| TDS-CLIP-ViT-L/14(8frames) | 73.4 | 93.8 | TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer
Learning | |
| AMD(ViT-B/16) | 73.3 | 94.0 | Asymmetric Masked Distillation for Pre-Training Small Foundation Models | - |