HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Ma Pingchuan ; Haliassos Alexandros ; Fernandez-Lopez Adriana ; Chen Honglie ; Petridis Stavros ; Pantic Maja

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Abstract

Audio-visual speech recognition has received a lot of attention due to itsrobustness against acoustic noise. Recently, the performance of automatic,visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR,respectively) has been substantially improved, mainly due to the use of largermodels and training sets. However, accurate labelling of datasets istime-consuming and expensive. Hence, in this work, we investigate the use ofautomatically-generated transcriptions of unlabelled datasets to increase thetraining set size. For this purpose, we use publicly-available pre-trained ASRmodels to automatically transcribe unlabelled datasets such as AVSpeech andVoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented trainingset, which consists of the LRS2 and LRS3 datasets as well as the additionalautomatically-transcribed data. We demonstrate that increasing the size of thetraining set, a recent trend in the literature, leads to reduced WER despiteusing noisy transcriptions. The proposed model achieves new state-of-the-artperformance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of0.9% on LRS3, a relative improvement of 30% over the current state-of-the-artapproach, and outperforms methods that have been trained on non-publiclyavailable datasets with 26 times more training data.

Code Repositories

mpc001/auto_avsr
Official
pytorch
Mentioned in GitHub
umbertocappellazzo/llama-avsr
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-speech-recognition-on-lrs2CTC/Attention
Test WER: 1.5
audio-visual-speech-recognition-on-lrs3-tedCTC/Attention
Word Error Rate (WER): 0.9
automatic-speech-recognition-asr-on-lrs3-tedCTC/Attention
Word Error Rate (WER): 1
automatic-speech-recognition-on-lrs2CTC/Attention
Test WER: 1.5
lipreading-on-lrs2Auto-AVSR
Word Error Rate (WER): 14.6
lipreading-on-lrs3-tedAuto-AVSR
Word Error Rate (WER): 19.1
visual-speech-recognition-on-lrs3-tedCTC/Attention
Word Error Rate (WER): 19.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels | Papers | HyperAI