| MMV TSM-50x2 | 95.2 | false | Audioset + Howto100M | Self-Supervised MultiModal Versatile Networks | |
| CVRL (R3D-152 2x; K600) | 93.9 | false | Kinetics600 | Spatiotemporal Contrastive Video Representation Learning | |
| CVRL (R3D-50; K600) | 93.4 | false | Kinetics600 | Spatiotemporal Contrastive Video Representation Learning | |
| BraVe:V-FA (TSM-50x2) | 93.1 | false | - | Broaden Your Views for Self-Supervised Video Learning | |
| CVRL (R3D-50; K400) | 92.2 | false | Kinetics400 | Spatiotemporal Contrastive Video Representation Learning | |
| AVID+CMA (Modified R2+1D-18 on Audioset) | 91.5 | false | Audioset (Audio+Video) | Audio-Visual Instance Discrimination with Cross-Modal Agreement | |