HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Sharma Rahul ; Narayanan Shrikanth

Audio-Visual Activity Guided Cross-Modal Identity Association for Active
  Speaker Detection

Abstract

Active speaker detection in videos addresses associating a source face,visible in the video frames, with the underlying speech in the audio modality.The two primary sources of information to derive such a speech-facerelationship are i) visual activity and its interaction with the speech signaland ii) co-occurrences of speakers' identities across modalities in the form offace and speech. The two approaches have their limitations: the audio-visualactivity models get confused with other frequently occurring vocal activities,such as laughing and chewing, while the speakers' identity-based methods arelimited to videos having enough disambiguating information to establish aspeech-face association. Since the two approaches are independent, weinvestigate their complementary nature in this work. We propose a novelunsupervised framework to guide the speakers' cross-modal identity associationwith the audio-visual activity for active speaker detection. Throughexperiments on entertainment media videos from two benchmark datasets, the AVAactive speaker (movies) and Visual Person Clustering Dataset (TV shows), weshow that a simple late fusion of the two approaches enhances the activespeaker detection performance.

Code Repositories

rash1993/movie-asd
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-active-speaker-detection-on-avaGSCMIA
validation mean average precision: 92.86%
audio-visual-active-speaker-detection-on-vpcdGSCMIA
mean average precision: 83.90

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection | Papers | HyperAI