| MaskFeat (no extra data, MViT-L) | 80.4 | 95.7 | Masked Feature Prediction for Self-Supervised Visual Pre-Training | |
| AIM (CLIP ViT-L/14, 32x224) | 80.4 | - | AIM: Adapting Image Models for Efficient Video Action Recognition | |
| MViTv2-L (ImageNet-21k pretrain) | 79.4 | 94.9 | MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | |