| UniFormer-B (IN-1K + Kinetics400) | 259x3 | 50.1 | 60.9 | 87.3 | UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | - |
| UniFormer-B (IN-1K + Kinetics600) | 41.8x3 | 21.4 | 57.6 | 84.9 | UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | - |
| EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer) | - | - | 57.2 | 83.9 | EAN: Event Adaptive Network for Enhanced Action Recognition | |
| BQNEn (ImageNet + K400 pretrained) | - | - | 57.1 | 84.2 | Busy-Quiet Video Disentangling for Video Classification | |
| TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | - | - | 56.8 | 84.1 | TDN: Temporal Difference Networks for Efficient Action Recognition | |
| CT-Net Ensemble (R50, 8+12+16+24) | - | - | 56.6 | - | CT-Net: Channel Tensorization Network for Video Classification | |