HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Speech Reconstruction with Reminiscent Sound via Visual Voice Memory

{Yong Man Ro Se Jin Park Minsu Kim Joanna Hong}

Abstract

The goal of this work is to reconstruct speech from silent video, in both speaker dependent and independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen speakers. The proposed memory takes additional auditory information that corresponds to the input face movements and stores the auditory contexts that can be recalled by the given input visual features. Specifically, the Visual Voice memory contains value and key memory slots, where value memory slots are for saving the audio features, and key memory slots are for storing the visual features in the same location of the saved audio features. Guiding each memory to properly save each feature, the model can adequately produce the speech. Hence, our method employs both video and audio information during training time but does not require any additional auditory input during inference. Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features. We validate the proposed framework on GRID and Lip2Wav datasets and show that our method surpasses the performance of previous works on both multi-speaker and speaker independent settings. We also demonstrate that the Visual Voice memory contains meaningful information to reconstruct speech.

Benchmarks

BenchmarkMethodologyMetrics
speaker-specific-lip-to-speech-synthesis-onVisual Voice Memory
ESTOI: 0.579
PESQ: 1.984
STOI: 0.738
speaker-specific-lip-to-speech-synthesis-on-3Visual Voice Memory
ESTOI: 0.304
PESQ: 1.362
STOI: 0.463
speaker-specific-lip-to-speech-synthesis-on-4Visual Voice Memory
ESTOI: 0.337
PESQ: 1.366
STOI: 0.504
speaker-specific-lip-to-speech-synthesis-on-5Visual Voice Memory
ESTOI: 0.402
PESQ: 1.612
STOI: 0.576
speaker-specific-lip-to-speech-synthesis-on-6Visual Voice Memory
ESTOI: 0.334
PESQ: 1.503
STOI: 0.506
speaker-specific-lip-to-speech-synthesis-on-7Visual Voice Memory
ESTOI: 0.429
PESQ: 1.529
STOI: 0.566

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Speech Reconstruction with Reminiscent Sound via Visual Voice Memory | Papers | HyperAI