HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Andrew Rouditchenko; Yuan Gong; Samuel Thomas; Leonid Karlinsky; Hilde Kuehne; Rogerio Feris; James Glass

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Abstract

Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data difference motivates us to adapt Whisper to handle video inputs. Inspired by Flamingo which injects visual features into language models, we propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our models achieve state-of-the-art ASR WER (0.68%) and AVSR WER (0.76%) on LRS3, and state-of-the-art ASR WER (1.3%) and AVSR WER (1.4%) on LRS2. Audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions. Moreover, Whisper-Flamingo is versatile and conducts all of these tasks using one set of parameters, while prior methods are trained separately on each language.

Code Repositories

roudimit/whisper-flamingo
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-speech-recognition-on-lrs2Whisper-Flamingo
Test WER: 1.4
audio-visual-speech-recognition-on-lrs3-tedWhisper-Flamingo
Word Error Rate (WER): 0.76
automatic-speech-recognition-on-lrs2Whisper
Test WER: 1.3
speech-recognition-on-lrs3-tedWhisper
Word Error Rate (WER): 0.68

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation | Papers | HyperAI