Command Palette
Search for a command to run...
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Shi Bowen ; Hsu Wei-Ning ; Lakhotia Kushal ; Mohamed Abdelrahman

Abstract
Video recordings of speech contain correlated audio and visual information,providing a strong signal for speech representation learning from the speaker'slip movements and the produced sound. We introduce Audio-Visual Hidden UnitBERT (AV-HuBERT), a self-supervised representation learning framework foraudio-visual speech, which masks multi-stream video input and predictsautomatically discovered and iteratively refined multimodal hidden units.AV-HuBERT learns powerful audio-visual speech representation benefiting bothlip-reading and automatic speech recognition. On the largest public lip-readingbenchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours oflabeled data, outperforming the former state-of-the-art approach (33.6%)trained with a thousand times more transcribed video data (31K hours). Thelip-reading WER is further reduced to 26.9% when using all 433 hours of labeleddata from LRS3 and combined with self-training. Using our audio-visualrepresentation on the same benchmark for audio-only speech recognition leads toa 40% relative WER reduction over the state-of-the-art performance (1.3% vs2.3%). Our code and models are available athttps://github.com/facebookresearch/av_hubert
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| lipreading-on-lrs3-ted | AV-HuBERT Large | Word Error Rate (WER): 26.9 |
| speech-recognition-on-lrs3-ted | AV-HuBERT Large | Word Error Rate (WER): 1.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.