| MAViL (Audio-Visual, single) | 0.533 | - | - |
| Audiovisual Masked Autoencoder (Audiovisual, Single) | 0.518 | Audiovisual Masked Autoencoders | |
| CAV-MAE (Audio-Visual) | 0.512 | Contrastive Audio-Visual Masked Autoencoder | |
| BEATs (Audio-only, Ensemble) | 0.506 | BEATs: Audio Pre-Training with Acoustic Tokenizers | |
| MBT (AS-500K training + Video) | 0.496 | Attention Bottlenecks for Multimodal Fusion | |
| BEATs (Audio-only, Single) | 0.486 | BEATs: Audio Pre-Training with Acoustic Tokenizers | |