HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers

{Gerasimos Potamianos Alexandros Koumparoulis}

Abstract

We present a novel resource-efficient end-to-end architecture for lipreading that achieves state-of-the-art results on a popular and challenging benchmark. In particular, we make the following contributions: First, inspired by the recent success of the EfficientNet architecture in image classification and our earlier work on resource-efficient lipreading models (MobiLipNet), we introduce Efficient-Nets to the lipreading task. Second, we show that the currently most popular in the literature 3D front-end contains a max-pool layer that prohibits networks from reaching superior performance and propose its removal. Finally, we improve our system’s back-end robustness by including a Transformer encoder. We evaluate our proposed system on the “Lipreading In-The-Wild” (LRW) corpus, a database containing short video segments from BBC TV broadcasts. The proposed network (T-variant) attains 88.53% word accuracy, a 0.17% absolute improvement over the current state-of-the-art, while being five times less computationally intensive. Further, an up-scaled version of our model (L-variant) achieves 89.52%, a new state-of-the-art result on the LRW corpus.

Benchmarks

BenchmarkMethodologyMetrics
lipreading-on-lip-reading-in-the-wild3D Conv + EfficientNetV2 + Transformer + TCN
Top-1 Accuracy: 89.52

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers | Papers | HyperAI