HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

{ Shilin Wang Feng Cheng Xingxuan Zhang}

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Abstract

Current state-of-the-art approaches for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition. Hence, these methods do not fully exploit the characteristics of the lip dynamics, causing two main drawbacks. First, the short-range temporal dependencies, which are critical to the mapping from lip images to visemes, receives no extra attention. Second, local spatial information is discarded in the existing sequence models due to the use of global average pooling (GAP). To well solve these drawbacks, we propose a Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well. From the experiment results, it is demonstrated that our method achieves comparable performance with the state-of-the-art approach using much less training data and much lighter Convolutional Feature Extractor. The training time is reduced by 12 days due to the convolutional structure and the local self-attention mechanism.

Benchmarks

BenchmarkMethodologyMetrics
lipreading-on-lrs2Conv-seq2seq
Word Error Rate (WER): 51.7
lipreading-on-lrs3-tedConv-seq2seq
Word Error Rate (WER): 60.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading | Papers | HyperAI