Command Palette
Search for a command to run...
Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid Chen Sun

Abstract
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that usesfusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | MBT (AV) | Acc@1: 80.8 Acc@5: 94.6 |
| action-classification-on-kinetics-sounds | MBT (AV) | Top 1 Accuracy: 85 Top 5 Accuracy: 96.8 |
| action-classification-on-moments-in-time | MBT (AV) | Top 1 Accuracy: 37.3 Top 5 Accuracy: 61.2 |
| action-recognition-on-epic-kitchens-100 | MBT | Action@1: 43.4 Noun@1: 58 Verb@1: 64.8 |
| audio-classification-on-audioset | MBT (AS-500K training + Video) | Test mAP: 0.496 |
| audio-classification-on-vggsound | MBT (AV) | Top 5 Accuracy: 85.6 |
| audio-classification-on-vggsound | MBT (A) | Top 1 Accuracy: 52.3 Top 5 Accuracy: 78.1 |
| audio-classification-on-vggsound | MBT (V) | Top 1 Accuracy: 51.2 Top 5 Accuracy: 72.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.