6 months ago

Abstract

This article presents a research methodology for audio-visual speech recognition (AVSR) in driver assistive systems. These systems necessitate ongoing interaction with drivers while driving through voice control for safety reasons. The article introduces a novel audio-visual speech command recognition transformer (AVCRFormer) specifically designed for robust AVSR. We propose (i) a multimodal fusion strategy based on spatio-temporal fusion of audio and video feature matrices, (ii) a regulated transformer based on iterative model refinement module with multiple encoders, (iii) a classifier ensemble strategy based on multiple decoders. The spatio-temporal fusion strategy preserves contextual information of both modalities and achieves their synchronization. An iterative model refinement module can bridge the gap between acoustic and visual data by leveraging their impact on speech recognition accuracy. The proposed multi-prediction strategy demonstrates superior performance compared to traditional single-prediction strategy, showcasing the model’s adaptability across diverse audio-visual contexts. The transformer proposed has achieved the highest values of speech command recognition accuracy, reaching 98.87% and 98.81% on the RUSAVIC and LRW corpora, respectively. This research has significant implications for advancing human–computer interaction. The capabilities of AVCRFormer extend beyond AVSR, making it a valuable contribution to the intersection of audio-visual processing and artificial intelligence.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 months ago

Multimodal

Audio and Speech Processing

Alexey Karpov Alexey Kashevnik Denis Ivanko Elena Ryumina Alexandr Axyonov Dmitry Ryumin

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 months ago

Multimodal

Audio and Speech Processing

Alexey Karpov Alexey Kashevnik Denis Ivanko Elena Ryumina Alexandr Axyonov Dmitry Ryumin

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Audio-Visual Speech Recognition based on Regulated Transformer and Spatio-Temporal Fusion Strategy for Driver Assistive Systems

Alexey Karpov Alexey Kashevnik Denis Ivanko Elena Ryumina Alexandr Axyonov Dmitry Ryumin

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Audio-Visual Speech Recognition based on Regulated Transformer and Spatio-Temporal Fusion Strategy for Driver Assistive Systems

Alexey Karpov Alexey Kashevnik Denis Ivanko Elena Ryumina Alexandr Axyonov Dmitry Ryumin

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Audio-Visual Speech Recognition based on Regulated Transformer and Spatio-Temporal Fusion Strategy for Driver Assistive Systems

Alexey Karpov Alexey Kashevnik Denis Ivanko Elena Ryumina Alexandr Axyonov Dmitry Ryumin

Abstract

Build AI with AI

HyperAI Newsletters