8 months ago

Audio and Speech Processing

Audio Recognition

Pingchuan Ma Stavros Petridis Maja Pantic

Abstract

In this work, we present a hybrid CTC/Attention model based on a ResNet-18and Convolution-augmented transformer (Conformer), that can be trained in anend-to-end manner. In particular, the audio and visual encoders learn toextract features directly from raw pixels and audio waveforms, respectively,which are then fed to conformers and then fusion takes place via a Multi-LayerPerceptron (MLP). The model learns to recognise characters using a combinationof CTC and an attention mechanism. We show that end-to-end training, instead ofusing pre-computed visual features which is common in the literature, the useof a conformer, instead of a recurrent network, and the use of atransformer-based language model, significantly improve the performance of ourmodel. We present results on the largest publicly available datasets forsentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and LipReading Sentences 3 (LRS3), respectively. The results show that our proposedmodels raise the state-of-the-art performance by a large margin in audio-only,visual-only, and audio-visual experiments.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Audio and Speech Processing

Audio Recognition

Pingchuan Ma Stavros Petridis Maja Pantic

Abstract

In this work, we present a hybrid CTC/Attention model based on a ResNet-18and Convolution-augmented transformer (Conformer), that can be trained in anend-to-end manner. In particular, the audio and visual encoders learn toextract features directly from raw pixels and audio waveforms, respectively,which are then fed to conformers and then fusion takes place via a Multi-LayerPerceptron (MLP). The model learns to recognise characters using a combinationof CTC and an attention mechanism. We show that end-to-end training, instead ofusing pre-computed visual features which is common in the literature, the useof a conformer, instead of a recurrent network, and the use of atransformer-based language model, significantly improve the performance of ourmodel. We present results on the largest publicly available datasets forsentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and LipReading Sentences 3 (LRS3), respectively. The results show that our proposedmodels raise the state-of-the-art performance by a large margin in audio-only,visual-only, and audio-visual experiments.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp