5 months ago

Sub-word Level Lip Reading With Visual Attention

Prajwal K R ; Afouras Triantafyllos ; Zisserman Andrew

Abstract

The goal of this paper is to learn strong lip reading models that canrecognise speech in silent videos. Most prior works deal with the open-setvisual speech recognition problem by adapting existing automatic speechrecognition techniques on top of trivially pooled visual features. Instead, inthis paper we focus on the unique challenges encountered in lip reading andpropose tailored solutions. To this end, we make the following contributions:(1) we propose an attention-based pooling mechanism to aggregate visual speechrepresentations; (2) we use sub-word units for lip reading for the first timeand show that this allows us to better model the ambiguities of the task; (3)we propose a model for Visual Speech Detection (VSD), trained on top of the lipreading network. Following the above, we obtain state-of-the-art results on thechallenging LRS2 and LRS3 benchmarks when training on public datasets, and evensurpass models trained on large-scale industrial datasets by using an order ofmagnitude less data. Our best model achieves 22.6% word error rate on the LRS2dataset, a performance unprecedented for lip reading models, significantlyreducing the performance gap between lip reading and automatic speechrecognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD modelsurpasses all visual-only baselines and even outperforms several recentaudio-visual methods.

Benchmarks

Benchmark	Methodology	Metrics
audio-visual-active-speaker-detection-on-ava	VTP (visual only)	validation mean average precision: 89.2%
lipreading-on-lrs2	VTP (more data)	Word Error Rate (WER): 22.6
lipreading-on-lrs2	VTP	Word Error Rate (WER): 28.9
lipreading-on-lrs3-ted	VTP (more data)	Word Error Rate (WER): 30.7
lipreading-on-lrs3-ted	VTP	Word Error Rate (WER): 40.6
visual-speech-recognition-on-lrs2	VTP with more data	Word Error Rate (WER): 22.6
visual-speech-recognition-on-lrs2	VTP	Word Error Rate (WER): 28.9
visual-speech-recognition-on-lrs3-ted	VTP	Word Error Rate (WER): 40.6
visual-speech-recognition-on-lrs3-ted	VTP with more data	Word Error Rate (WER): 30.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning