Command Palette
Search for a command to run...
Ma Pingchuan ; Haliassos Alexandros ; Fernandez-Lopez Adriana ; Chen Honglie ; Petridis Stavros ; Pantic Maja

Abstract
Audio-visual speech recognition has received a lot of attention due to itsrobustness against acoustic noise. Recently, the performance of automatic,visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR,respectively) has been substantially improved, mainly due to the use of largermodels and training sets. However, accurate labelling of datasets istime-consuming and expensive. Hence, in this work, we investigate the use ofautomatically-generated transcriptions of unlabelled datasets to increase thetraining set size. For this purpose, we use publicly-available pre-trained ASRmodels to automatically transcribe unlabelled datasets such as AVSpeech andVoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented trainingset, which consists of the LRS2 and LRS3 datasets as well as the additionalautomatically-transcribed data. We demonstrate that increasing the size of thetraining set, a recent trend in the literature, leads to reduced WER despiteusing noisy transcriptions. The proposed model achieves new state-of-the-artperformance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of0.9% on LRS3, a relative improvement of 30% over the current state-of-the-artapproach, and outperforms methods that have been trained on non-publiclyavailable datasets with 26 times more training data.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-visual-speech-recognition-on-lrs2 | CTC/Attention | Test WER: 1.5 |
| audio-visual-speech-recognition-on-lrs3-ted | CTC/Attention | Word Error Rate (WER): 0.9 |
| automatic-speech-recognition-asr-on-lrs3-ted | CTC/Attention | Word Error Rate (WER): 1 |
| automatic-speech-recognition-on-lrs2 | CTC/Attention | Test WER: 1.5 |
| lipreading-on-lrs2 | Auto-AVSR | Word Error Rate (WER): 14.6 |
| lipreading-on-lrs3-ted | Auto-AVSR | Word Error Rate (WER): 19.1 |
| visual-speech-recognition-on-lrs3-ted | CTC/Attention | Word Error Rate (WER): 19.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.