8 months ago

Abstract

Recent advances in the Active Speaker Detection (ASD) problem build upon atwo-stage process: feature extraction and spatio-temporal context aggregation.In this paper, we propose an end-to-end ASD workflow where feature learning andcontextual predictions are jointly learned. Our end-to-end trainable networksimultaneously learns multi-modal embeddings and aggregates spatio-temporalcontext. This results in more suitable feature representations and improvedperformance in the ASD task. We also introduce interleaved graph neural network(iGNN) blocks, which split the message passing according to the main sources ofcontext in the ASD problem. Experiments show that the aggregated features fromthe iGNN blocks are more suitable for ASD, resulting in state-of-the artperformance. Finally, we design a weakly-supervised strategy, whichdemonstrates that the ASD problem can also be approached by utilizingaudiovisual data but relying exclusively on audio annotations. We achieve thisby modelling the direct relationship between the audio signal and the possiblesound sources (speakers), as well as introducing a contrastive loss. All theresources of this project will be made available at:https://github.com/fuankarion/end-to-end-asd.

Source PDF View Code