3 months ago

Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion

{Bing Yang Zhan Chen Hong Liu}

Abstract

Current studies have shown that extracting representative visual features and efficiently fusing audio and visual modalities are vital for audio-visual speech recognition (AVSR), but these are still challenging. To this end, we propose a lip graph assisted AVSR method with bidirectional synchronous fusion. First, a hybrid visual stream combines the image branch and graph branch to capture discriminative visual features. Specially, the lip graph exploits the natural and dynamic connections between the lip key points to model the lip shape, and the temporal evolution of the lip graph is captured by the graph convolutional networks followed by bidirectional gated recurrent units. Second, the hybrid visual stream is combined with the audio stream by an attention-based bidirectional synchronous fusion which allows bidirectional information interaction to resolve the asynchrony between the two modalities during fusion. The experimental results on LRW-BBC dataset show that our method outperforms the end-to-end AVSR baseline method in both clean and noisy conditions.

Benchmarks

Benchmark	Methodology	Metrics
landmark-based-lipreading-on-lrw	Lip Graph Assisted	Top 1 Accuracy: 49.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning