Command Palette
Search for a command to run...
{Mohamed Omar Linda Liu Xiang Hao Xiaohang Sun Kevin Hsu Jingru Yi Wentao Zhu}

Abstract
Action recognition is an essential field for video understanding. To learn from heterogeneous data sources effectively, in this work, we propose a novel multimodal action recognition approach termed Audio-Video Transformer (AVT). AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds and Epic-Kitchens-100 datasets by 8% and 1%, respectively, without external training data. AVT also surpasses one of the previous state-of-the-art video Transformers by 10% on the VGGSound dataset by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal Transformers, AVT is 1.3x more efficient in terms of FLOPs and improves the accuracy by 4.2% on Epic-Kitchens-100. Visualization results further demonstrate that the audio provides complementary and discriminative features, and our AVT can effectively understand the action from a combination of audio and video.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-recognition-on-epic-kitchens-100 | AVT | Action@1: 47.2 Noun@1: 59.3 Verb@1: 70.4 |
| audio-classification-on-vggsound | AVT (Audio-Visual) | Top 1 Accuracy: 63.9 Top 5 Accuracy: 85.0 |
| audio-classification-on-vggsound | AVT (V) | Top 1 Accuracy: 53.2 Top 5 Accuracy: 74.8 |
| multi-modal-classification-on-vgg-sound | AVT | Top-1 Accuracy: 63.9 Top-5 Accuracy: 85.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.