8 months ago

Abstract

Audio-visual speech separation methods aim to integrate different modalitiesto generate high-quality separated speech, thereby enhancing the performance ofdownstream tasks such as speech recognition. Most existing state-of-the-art(SOTA) models operate in the time domain. However, their overly simplisticapproach to modeling acoustic features often necessitates larger and morecomputationally intensive models in order to achieve SOTA performance. In thispaper, we present a novel time-frequency domain audio-visual speech separationmethod: Recurrent Time-Frequency Separation Network (RTFS-Net), which appliesits algorithms on the complex time-frequency bins yielded by the Short-TimeFourier Transform. We model and capture the time and frequency dimensions ofthe audio independently using a multi-layered RNN along each dimension.Furthermore, we introduce a unique attention-based fusion technique for theefficient integration of audio and visual information, and a new maskseparation approach that takes advantage of the intrinsic spectral nature ofthe acoustic features for a clearer separation. RTFS-Net outperforms the priorSOTA method in both inference speed and separation quality while reducing thenumber of parameters by 90% and MACs by 83%. This is the first time-frequencydomain audio-visual speech separation method to outperform all contemporarytime-domain counterparts.

Source PDF