8 months ago

Audio and Speech Processing

Audio Recognition

Ma Pingchuan ; Haliassos Alexandros ; Fernandez-Lopez Adriana ; Chen Honglie ; Petridis Stavros ; Pantic Maja

Abstract

Audio-visual speech recognition has received a lot of attention due to itsrobustness against acoustic noise. Recently, the performance of automatic,visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR,respectively) has been substantially improved, mainly due to the use of largermodels and training sets. However, accurate labelling of datasets istime-consuming and expensive. Hence, in this work, we investigate the use ofautomatically-generated transcriptions of unlabelled datasets to increase thetraining set size. For this purpose, we use publicly-available pre-trained ASRmodels to automatically transcribe unlabelled datasets such as AVSpeech andVoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented trainingset, which consists of the LRS2 and LRS3 datasets as well as the additionalautomatically-transcribed data. We demonstrate that increasing the size of thetraining set, a recent trend in the literature, leads to reduced WER despiteusing noisy transcriptions. The proposed model achieves new state-of-the-artperformance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of0.9% on LRS3, a relative improvement of 30% over the current state-of-the-artapproach, and outperforms methods that have been trained on non-publiclyavailable datasets with 26 times more training data.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Audio and Speech Processing

Audio Recognition

Ma Pingchuan ; Haliassos Alexandros ; Fernandez-Lopez Adriana ; Chen Honglie ; Petridis Stavros ; Pantic Maja

Abstract

Audio-visual speech recognition has received a lot of attention due to itsrobustness against acoustic noise. Recently, the performance of automatic,visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR,respectively) has been substantially improved, mainly due to the use of largermodels and training sets. However, accurate labelling of datasets istime-consuming and expensive. Hence, in this work, we investigate the use ofautomatically-generated transcriptions of unlabelled datasets to increase thetraining set size. For this purpose, we use publicly-available pre-trained ASRmodels to automatically transcribe unlabelled datasets such as AVSpeech andVoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented trainingset, which consists of the LRS2 and LRS3 datasets as well as the additionalautomatically-transcribed data. We demonstrate that increasing the size of thetraining set, a recent trend in the literature, leads to reduced WER despiteusing noisy transcriptions. The proposed model achieves new state-of-the-artperformance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of0.9% on LRS3, a relative improvement of 30% over the current state-of-the-artapproach, and outperforms methods that have been trained on non-publiclyavailable datasets with 26 times more training data.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp