| DEEP-HAL with ODF+SDF (AssembleNet++) | 62.29 | Self-supervising Action Recognition by Statistical Moment and Subspace
Descriptors | - |
| AdaFocus (weak supervision, MViT-B-24, 32x3) | 47.8 | Towards Weakly Supervised End-to-end Learning for Long-video Action
Recognition | - |
| MViT-B-24, 32x3 (Kinetics-600 pretraining) | 47.7 | Multiscale Vision Transformers | |
| MViT-B, 32x3 (Kinetics-600 pretraining) | 47.1 | Multiscale Vision Transformers | |
| MViT-B-24, 32x3 (Kinetics-400 pretraining) | 46.3 | Multiscale Vision Transformers | |
| SlowFast (Kinetics-600 pretraining, NL) | 45.2 | SlowFast Networks for Video Recognition | |
| MViT-B, 32x3 (Kinetics-400 pretraining) | 44.3 | Multiscale Vision Transformers | |