Command Palette
Search for a command to run...
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Yeo Jeong Hun Rha Hyeongseop Park Se Jin Ro Yong Man

Abstract
Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition innoisy environments by combining auditory and visual information. However,recent Large Language Model (LLM) based AVSR systems incur high computationalcosts due to the high temporal resolution of audio-visual speech processed byLLMs. In this work, we introduce an efficient multimodal speech LLM frameworkthat minimizes token length while preserving essential linguistic content. Ourapproach employs an early AV-fusion module for streamlined feature integration,an audio-visual speech Q-Former that dynamically allocates tokens based oninput duration, and a refined query allocation strategy with a speech ratepredictor to adjust token allocation according to speaking speed of each audiosample. Extensive experiments on the LRS3 dataset show that our method achievesstate-of-the-art performance with a WER of 0.72% while using only 3.5 tokensper second. Moreover, our approach not only reduces token usage by 86% comparedto the previous multimodal speech LLM framework, but also improvescomputational efficiency by reducing FLOPs by 35.7%.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-visual-speech-recognition-on-lrs3-ted | MMS-LLaMA | Word Error Rate (WER): 0.74 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.