HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Visual Speech Enhancement Without A Real Visual Stream

Sindhu B Hegde K R Prajwal Rudrabha Mukhopadhyay Vinay Namboodiri C.V. Jawahar

Visual Speech Enhancement Without A Real Visual Stream

Abstract

In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only" methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable (< 3% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as human evaluations. Additional ablation studies and a demo video on our website containing qualitative comparisons and results clearly illustrate the effectiveness of our approach. We provide a demo video which clearly illustrates the effectiveness of our proposed approach on our website: \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/visual-speech-enhancement-without-a-real-visual-stream}. The code and models are also released for future research: \url{https://github.com/Sindhu-Hegde/pseudo-visual-speech-denoising}.

Code Repositories

Sindhu-Hegde/pseudo-visual-speech-denoising
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
speech-denoising-on-lrs2-vggsound-
CBAK: 2.41
COVL: 2.15
CSIG: 3.16
PESQ: 2.71
STOI: 0.87
speech-denoising-on-lrs3-vggsound-
CBAK: 2.47
COVL: 2.25
CSIG: 3.18
PESQ: 2.72
STOI: 0.88

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Visual Speech Enhancement Without A Real Visual Stream | Papers | HyperAI