HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Attention Bottlenecks for Multimodal Fusion

Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid Chen Sun

Attention Bottlenecks for Multimodal Fusion

Abstract

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that usesfusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400MBT (AV)
Acc@1: 80.8
Acc@5: 94.6
action-classification-on-kinetics-soundsMBT (AV)
Top 1 Accuracy: 85
Top 5 Accuracy: 96.8
action-classification-on-moments-in-timeMBT (AV)
Top 1 Accuracy: 37.3
Top 5 Accuracy: 61.2
action-recognition-on-epic-kitchens-100MBT
Action@1: 43.4
Noun@1: 58
Verb@1: 64.8
audio-classification-on-audiosetMBT (AS-500K training + Video)
Test mAP: 0.496
audio-classification-on-vggsoundMBT (AV)
Top 5 Accuracy: 85.6
audio-classification-on-vggsoundMBT (A)
Top 1 Accuracy: 52.3
Top 5 Accuracy: 78.1
audio-classification-on-vggsoundMBT (V)
Top 1 Accuracy: 51.2
Top 5 Accuracy: 72.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Attention Bottlenecks for Multimodal Fusion | Papers | HyperAI