HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

AVT: Audio-Video Transformer for Multimodal Action Recognition

{Mohamed Omar Linda Liu Xiang Hao Xiaohang Sun Kevin Hsu Jingru Yi Wentao Zhu}

AVT: Audio-Video Transformer for Multimodal Action Recognition

Abstract

Action recognition is an essential field for video understanding. To learn from heterogeneous data sources effectively, in this work, we propose a novel multimodal action recognition approach termed Audio-Video Transformer (AVT). AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds and Epic-Kitchens-100 datasets by 8% and 1%, respectively, without external training data. AVT also surpasses one of the previous state-of-the-art video Transformers by 10% on the VGGSound dataset by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal Transformers, AVT is 1.3x more efficient in terms of FLOPs and improves the accuracy by 4.2% on Epic-Kitchens-100. Visualization results further demonstrate that the audio provides complementary and discriminative features, and our AVT can effectively understand the action from a combination of audio and video.

Benchmarks

BenchmarkMethodologyMetrics
action-recognition-on-epic-kitchens-100AVT
Action@1: 47.2
Noun@1: 59.3
Verb@1: 70.4
audio-classification-on-vggsoundAVT (Audio-Visual)
Top 1 Accuracy: 63.9
Top 5 Accuracy: 85.0
audio-classification-on-vggsoundAVT (V)
Top 1 Accuracy: 53.2
Top 5 Accuracy: 74.8
multi-modal-classification-on-vgg-soundAVT
Top-1 Accuracy: 63.9
Top-5 Accuracy: 85.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp