HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

Pegg Samuel ; Li Kai ; Hu Xiaolin

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual
  Speech Separation

Abstract

Audio-visual speech separation methods aim to integrate different modalitiesto generate high-quality separated speech, thereby enhancing the performance ofdownstream tasks such as speech recognition. Most existing state-of-the-art(SOTA) models operate in the time domain. However, their overly simplisticapproach to modeling acoustic features often necessitates larger and morecomputationally intensive models in order to achieve SOTA performance. In thispaper, we present a novel time-frequency domain audio-visual speech separationmethod: Recurrent Time-Frequency Separation Network (RTFS-Net), which appliesits algorithms on the complex time-frequency bins yielded by the Short-TimeFourier Transform. We model and capture the time and frequency dimensions ofthe audio independently using a multi-layered RNN along each dimension.Furthermore, we introduce a unique attention-based fusion technique for theefficient integration of audio and visual information, and a new maskseparation approach that takes advantage of the intrinsic spectral nature ofthe acoustic features for a clearer separation. RTFS-Net outperforms the priorSOTA method in both inference speed and separation quality while reducing thenumber of parameters by 90% and MACs by 83%. This is the first time-frequencydomain audio-visual speech separation method to outperform all contemporarytime-domain counterparts.

Code Repositories

spkgyk/RTFS-Net
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
speech-separation-on-lrs2RTFS-Net-12
SDRi: 15.1
SI-SNRi: 14.9
speech-separation-on-lrs2RTFS-Net-6
SDRi: 14.8
SI-SNRi: 14.6
speech-separation-on-lrs2RTFS-Net-4
SDRi: 14.3
SI-SNRi: 14.1
speech-separation-on-lrs3RTFS-Net-6
SDRi: 17.1
SI-SNRi: 16.9
speech-separation-on-lrs3RTFS-Net-4
SDRi: 15.6
SI-SNRi: 15.5
speech-separation-on-lrs3RTFS-Net-12
SDRi: 17.6
SI-SNRi: 17.5
speech-separation-on-voxceleb2RTFS-Net-4
SDRi: 12.4
SI-SNRi: 11.5
speech-separation-on-voxceleb2RTFS-Net-12
SDRi: 13.6
SI-SNRi: 12.4
speech-separation-on-voxceleb2RTFS-Net-6
SDRi: 12.8
SI-SNRi: 11.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation | Papers | HyperAI