HyperAI

Abstract

Action recognition is an essential field for video understanding. To learn from heterogeneous data sources effectively, in this work, we propose a novel multimodal action recognition approach termed Audio-Video Transformer (AVT). AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds and Epic-Kitchens-100 datasets by 8% and 1%, respectively, without external training data. AVT also surpasses one of the previous state-of-the-art video Transformers by 10% on the VGGSound dataset by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal Transformers, AVT is 1.3x more efficient in terms of FLOPs and improves the accuracy by 4.2% on Epic-Kitchens-100. Visualization results further demonstrate that the audio provides complementary and discriminative features, and our AVT can effectively understand the action from a combination of audio and video.

Abstract

Mohamed Omar Linda Liu Xiang Hao Xiaohang Sun Kevin Hsu Jingru Yi Wentao Zhu

Abstract

Build AI with AI

HyperAI Newsletters

Mohamed Omar Linda Liu Xiang Hao Xiaohang Sun Kevin Hsu Jingru Yi Wentao Zhu

Abstract

Build AI with AI

HyperAI Newsletters

Mohamed Omar Linda Liu Xiang Hao Xiaohang Sun Kevin Hsu Jingru Yi Wentao Zhu

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

AVT: Audio-Video Transformer for Multimodal Action Recognition

Mohamed Omar Linda Liu Xiang Hao Xiaohang Sun Kevin Hsu Jingru Yi Wentao Zhu

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

AVT: Audio-Video Transformer for Multimodal Action Recognition

Mohamed Omar Linda Liu Xiang Hao Xiaohang Sun Kevin Hsu Jingru Yi Wentao Zhu

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

AVT: Audio-Video Transformer for Multimodal Action Recognition

Mohamed Omar Linda Liu Xiang Hao Xiaohang Sun Kevin Hsu Jingru Yi Wentao Zhu

Abstract

Build AI with AI

HyperAI Newsletters