3 months ago

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

Yeo Jeong Hun Rha Hyeongseop Park Se Jin Ro Yong Man

Abstract

Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition innoisy environments by combining auditory and visual information. However,recent Large Language Model (LLM) based AVSR systems incur high computationalcosts due to the high temporal resolution of audio-visual speech processed byLLMs. In this work, we introduce an efficient multimodal speech LLM frameworkthat minimizes token length while preserving essential linguistic content. Ourapproach employs an early AV-fusion module for streamlined feature integration,an audio-visual speech Q-Former that dynamically allocates tokens based oninput duration, and a refined query allocation strategy with a speech ratepredictor to adjust token allocation according to speaking speed of each audiosample. Extensive experiments on the LRS3 dataset show that our method achievesstate-of-the-art performance with a WER of 0.72% while using only 3.5 tokensper second. Moreover, our approach not only reduces token usage by 86% comparedto the previous multimodal speech LLM framework, but also improvescomputational efficiency by reducing FLOPs by 35.7%.

Code Repositories

JeongHun0716/MMS-LLaMA

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
audio-visual-speech-recognition-on-lrs3-ted	MMS-LLaMA	Word Error Rate (WER): 0.74

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette