8 months ago

Abstract

Active speaker detection in videos addresses associating a source face,visible in the video frames, with the underlying speech in the audio modality.The two primary sources of information to derive such a speech-facerelationship are i) visual activity and its interaction with the speech signaland ii) co-occurrences of speakers' identities across modalities in the form offace and speech. The two approaches have their limitations: the audio-visualactivity models get confused with other frequently occurring vocal activities,such as laughing and chewing, while the speakers' identity-based methods arelimited to videos having enough disambiguating information to establish aspeech-face association. Since the two approaches are independent, weinvestigate their complementary nature in this work. We propose a novelunsupervised framework to guide the speakers' cross-modal identity associationwith the audio-visual activity for active speaker detection. Throughexperiments on entertainment media videos from two benchmark datasets, the AVAactive speaker (movies) and Visual Person Clustering Dataset (TV shows), weshow that a simple late fusion of the two approaches enhances the activespeaker detection performance.

Source PDF View Code