5 months ago

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Pan Xichen ; Chen Peiyu ; Gong Yichen ; Zhou Helong ; Wang Xinbing ; Lin Zhouhan

Abstract

Training Transformer-based models demands a large amount of data, whileobtaining aligned and labelled data in multimodality is rather cost-demanding,especially for audio-visual speech recognition (AVSR). Thus it makes a lot ofsense to make use of unlabelled unimodal data. On the other side, although theeffectiveness of large-scale self-supervised learning is well established inboth audio and visual modalities, how to integrate those pre-trained modelsinto a multimodal scenario remains underexplored. In this work, we successfullyleverage unimodal self-supervised learning to promote the multimodal AVSR. Inparticular, audio and visual front-ends are trained on large-scale unimodaldatasets, then we integrate components of both front-ends into a largermultimodal framework which learns to recognize parallel audio-visual data intocharacters through a combination of CTC and seq2seq decoding. We show that bothcomponents inherited from unimodal self-supervised learning cooperate well,resulting in that the multimodal framework yields competitive results throughfine-tuning. Our model is experimentally validated on both word-level andsentence-level tasks. Especially, even without an external language model, ourproposed model raises the state-of-the-art performances on the widely acceptedLip Reading Sentences 2 (LRS2) dataset by a large margin, with a relativeimprovement of 30%.

Code Repositories

lumia-group/leveraging-self-supervised-learning-for-avsr

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
audio-visual-speech-recognition-on-lrs2	MoCo + wav2vec (w/o extLM)	Test WER: 2.6
automatic-speech-recognition-on-lrs2	MoCo + wav2vec (w/o extLM)	Test WER: 2.7
lipreading-on-lip-reading-in-the-wild	MoCo + Wav2Vec by SJTU LUMIA	Top-1 Accuracy: 85.0
lipreading-on-lrs2	MoCo + wav2vec (w/o extLM)	Word Error Rate (WER): 43.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette