HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

Yeo Jeong Hun Rha Hyeongseop Park Se Jin Ro Yong Man

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with
  Minimal Multimodal Speech Tokens

Abstract

Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition innoisy environments by combining auditory and visual information. However,recent Large Language Model (LLM) based AVSR systems incur high computationalcosts due to the high temporal resolution of audio-visual speech processed byLLMs. In this work, we introduce an efficient multimodal speech LLM frameworkthat minimizes token length while preserving essential linguistic content. Ourapproach employs an early AV-fusion module for streamlined feature integration,an audio-visual speech Q-Former that dynamically allocates tokens based oninput duration, and a refined query allocation strategy with a speech ratepredictor to adjust token allocation according to speaking speed of each audiosample. Extensive experiments on the LRS3 dataset show that our method achievesstate-of-the-art performance with a WER of 0.72% while using only 3.5 tokensper second. Moreover, our approach not only reduces token usage by 86% comparedto the previous multimodal speech LLM framework, but also improvescomputational efficiency by reducing FLOPs by 35.7%.

Code Repositories

JeongHun0716/MMS-LLaMA
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-speech-recognition-on-lrs3-tedMMS-LLaMA
Word Error Rate (WER): 0.74

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens | Papers | HyperAI