| BraVe:V-FA (TSM-50x2) | false | - | 70.5 | Broaden Your Views for Self-Supervised Video Learning | |
| CVRL (R3D-152 2x; K600) | false | Kinetics600 | 69.9 | Spatiotemporal Contrastive Video Representation Learning | |
| CVRL (R3D-50; K600) | false | Kinetics600 | 68.0 | Spatiotemporal Contrastive Video Representation Learning | |
| CVRL (R3D-50; K400) | false | Kinetics400 | 66.7 | Spatiotemporal Contrastive Video Representation Learning | |
| AVID+CMA (Modified R2+1D-18 on Audioset) | false | Audioset (Video+Audio) | 64.7 | Audio-Visual Instance Discrimination with Cross-Modal Agreement | |
| AVID (Modified R2+1D-18 on Audioset) | false | Audioset (Video+Audio) | 64.1 | Audio-Visual Instance Discrimination with Cross-Modal Agreement | |