HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Jointly Learning Visual and Auditory Speech Representations from Raw Data

Haliassos Alexandros ; Ma Pingchuan ; Mira Rodrigo ; Petridis Stavros ; Pantic Maja

Jointly Learning Visual and Auditory Speech Representations from Raw
  Data

Abstract

We present RAVEn, a self-supervised multi-modal approach to jointly learnvisual and auditory speech representations. Our pre-training objective involvesencoding masked inputs, and then predicting contextualised targets generated byslowly-evolving momentum encoders. Driven by the inherent differences betweenvideo and audio, our design is asymmetric w.r.t. the two modalities' pretexttasks: Whereas the auditory stream predicts both the visual and auditorytargets, the visual one predicts only the auditory targets. We observe strongresults in low- and high-resource labelled data settings when fine-tuning thevisual and auditory encoders resulting from a single pre-training stage, inwhich the encoders are jointly trained. Notably, RAVEn surpasses allself-supervised methods on visual speech recognition (VSR) on LRS3, andcombining RAVEn with self-training using only 30 hours of labelled data evenoutperforms a recent semi-supervised method trained on 90,000 hours ofnon-public data. At the same time, we achieve state-of-the-art results in theLRS3 low-resource setting for auditory speech recognition (as well as for VSR).Our findings point to the viability of learning powerful speech representationsentirely from raw video and audio, i.e., without relying on handcraftedfeatures. Code and models are available at https://github.com/ahaliassos/raven.

Code Repositories

ahaliassos/raven
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-speech-recognition-on-lrs3-tedRAVEn Large
Word Error Rate (WER): 1.4
lipreading-on-lrs2RAVEn Large
Word Error Rate (WER): 18.6
lipreading-on-lrs3-tedRAVEn Large
Word Error Rate (WER): 23.4
speech-recognition-on-lrs2RAVEn Large
Word Error Rate (WER): 2.1
speech-recognition-on-lrs3-tedRAVEn Large
Word Error Rate (WER): 1.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Jointly Learning Visual and Auditory Speech Representations from Raw Data | Papers | HyperAI