HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Pan Xichen ; Chen Peiyu ; Gong Yichen ; Zhou Helong ; Wang Xinbing ; Lin Zhouhan

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual
  Speech Recognition

Abstract

Training Transformer-based models demands a large amount of data, whileobtaining aligned and labelled data in multimodality is rather cost-demanding,especially for audio-visual speech recognition (AVSR). Thus it makes a lot ofsense to make use of unlabelled unimodal data. On the other side, although theeffectiveness of large-scale self-supervised learning is well established inboth audio and visual modalities, how to integrate those pre-trained modelsinto a multimodal scenario remains underexplored. In this work, we successfullyleverage unimodal self-supervised learning to promote the multimodal AVSR. Inparticular, audio and visual front-ends are trained on large-scale unimodaldatasets, then we integrate components of both front-ends into a largermultimodal framework which learns to recognize parallel audio-visual data intocharacters through a combination of CTC and seq2seq decoding. We show that bothcomponents inherited from unimodal self-supervised learning cooperate well,resulting in that the multimodal framework yields competitive results throughfine-tuning. Our model is experimentally validated on both word-level andsentence-level tasks. Especially, even without an external language model, ourproposed model raises the state-of-the-art performances on the widely acceptedLip Reading Sentences 2 (LRS2) dataset by a large margin, with a relativeimprovement of 30%.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-speech-recognition-on-lrs2MoCo + wav2vec (w/o extLM)
Test WER: 2.6
automatic-speech-recognition-on-lrs2MoCo + wav2vec (w/o extLM)
Test WER: 2.7
lipreading-on-lip-reading-in-the-wildMoCo + Wav2Vec by SJTU LUMIA
Top-1 Accuracy: 85.0
lipreading-on-lrs2MoCo + wav2vec (w/o extLM)
Word Error Rate (WER): 43.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition | Papers | HyperAI