HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

Yeo Jeong Hun ; Han Seunghee ; Kim Minsu ; Ro Yong Man

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and
  Context-Aware Visual Speech Processing

Abstract

In visual speech processing, context modeling capability is one of the mostimportant requirements due to the ambiguous nature of lip movements. Forexample, homophenes, words that share identical lip movements but producedifferent sounds, can be distinguished by considering the context. In thispaper, we propose a novel framework, namely Visual Speech Processingincorporated with LLMs (VSP-LLM), to maximize the context modeling ability bybringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed toperform multi-tasks of visual speech recognition and translation, where thegiven instructions control the type of task. The input video is mapped to theinput latent space of an LLM by employing a self-supervised visual speechmodel. Focused on the fact that there is redundant information in input frames,we propose a novel deduplication method that reduces the embedded visualfeatures by employing visual speech units. Through the proposed deduplicationand Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationallyefficient manner. In the translation dataset, the MuAViC benchmark, wedemonstrate that VSP-LLM trained on just 30 hours of labeled data can moreeffectively translate lip movements compared to the recent model trained with433 hours of data.

Code Repositories

sally-sh/vsp-llm
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
lipreading-on-lrs3-tedVSP-LLM
Word Error Rate (WER): 25.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing | Papers | HyperAI