HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

Stavros Petridis; Themos Stafylakis; Pingchuan Ma; Georgios Tzimiropoulos; Maja Pantic

Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

Abstract

Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption. In this paper, we use the recently proposed hybrid CTC/attention architecture for audio-visual recognition of speech in-the-wild. To the best of our knowledge, this is the first time that such a hybrid architecture architecture is used for audio-visual recognition of speech. We use the LRS2 database and show that the proposed audio-visual model leads to an 1.3% absolute decrease in word error rate over the audio-only model and achieves the new state-of-the-art performance on LRS2 database (7% word error rate). We also observe that the audio-visual model significantly outperforms the audio-based model (up to 32.9% absolute improvement in word error rate) for several different types of noise as the signal-to-noise ratio decreases.

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-speech-recognition-on-lrs2CTC/Attention
Test WER: 7.0
automatic-speech-recognition-on-lrs2CTC/attention
Test WER: 8.2
lipreading-on-lrs2Hybrid CTC / Attention
Word Error Rate (WER): 50

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture | Papers | HyperAI