HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Sub-word Level Lip Reading With Visual Attention

Prajwal K R ; Afouras Triantafyllos ; Zisserman Andrew

Sub-word Level Lip Reading With Visual Attention

Abstract

The goal of this paper is to learn strong lip reading models that canrecognise speech in silent videos. Most prior works deal with the open-setvisual speech recognition problem by adapting existing automatic speechrecognition techniques on top of trivially pooled visual features. Instead, inthis paper we focus on the unique challenges encountered in lip reading andpropose tailored solutions. To this end, we make the following contributions:(1) we propose an attention-based pooling mechanism to aggregate visual speechrepresentations; (2) we use sub-word units for lip reading for the first timeand show that this allows us to better model the ambiguities of the task; (3)we propose a model for Visual Speech Detection (VSD), trained on top of the lipreading network. Following the above, we obtain state-of-the-art results on thechallenging LRS2 and LRS3 benchmarks when training on public datasets, and evensurpass models trained on large-scale industrial datasets by using an order ofmagnitude less data. Our best model achieves 22.6% word error rate on the LRS2dataset, a performance unprecedented for lip reading models, significantlyreducing the performance gap between lip reading and automatic speechrecognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD modelsurpasses all visual-only baselines and even outperforms several recentaudio-visual methods.

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-active-speaker-detection-on-avaVTP (visual only)
validation mean average precision: 89.2%
lipreading-on-lrs2VTP (more data)
Word Error Rate (WER): 22.6
lipreading-on-lrs2VTP
Word Error Rate (WER): 28.9
lipreading-on-lrs3-tedVTP (more data)
Word Error Rate (WER): 30.7
lipreading-on-lrs3-tedVTP
Word Error Rate (WER): 40.6
visual-speech-recognition-on-lrs2VTP with more data
Word Error Rate (WER): 22.6
visual-speech-recognition-on-lrs2VTP
Word Error Rate (WER): 28.9
visual-speech-recognition-on-lrs3-tedVTP
Word Error Rate (WER): 40.6
visual-speech-recognition-on-lrs3-tedVTP with more data
Word Error Rate (WER): 30.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Sub-word Level Lip Reading With Visual Attention | Papers | HyperAI