HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Shi Bowen ; Hsu Wei-Ning ; Lakhotia Kushal ; Mohamed Abdelrahman

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster
  Prediction

Abstract

Video recordings of speech contain correlated audio and visual information,providing a strong signal for speech representation learning from the speaker'slip movements and the produced sound. We introduce Audio-Visual Hidden UnitBERT (AV-HuBERT), a self-supervised representation learning framework foraudio-visual speech, which masks multi-stream video input and predictsautomatically discovered and iteratively refined multimodal hidden units.AV-HuBERT learns powerful audio-visual speech representation benefiting bothlip-reading and automatic speech recognition. On the largest public lip-readingbenchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours oflabeled data, outperforming the former state-of-the-art approach (33.6%)trained with a thousand times more transcribed video data (31K hours). Thelip-reading WER is further reduced to 26.9% when using all 433 hours of labeleddata from LRS3 and combined with self-training. Using our audio-visualrepresentation on the same benchmark for audio-only speech recognition leads toa 40% relative WER reduction over the state-of-the-art performance (1.3% vs2.3%). Our code and models are available athttps://github.com/facebookresearch/av_hubert

Code Repositories

facebookresearch/av_hubert
Official
pytorch
Mentioned in GitHub
guxm2021/MM_ALT
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
lipreading-on-lrs3-tedAV-HuBERT Large
Word Error Rate (WER): 26.9
speech-recognition-on-lrs3-tedAV-HuBERT Large
Word Error Rate (WER): 1.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction | Papers | HyperAI