HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

Jeong Hun Yeo Minsu Kim Chae Won Kim Stavros Petridis Yong Man Ro

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by
  Learning Language-Agnostic Speech Representations

Abstract

We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR)framework, dubbed Zero-AVSR, which enables speech recognition in targetlanguages without requiring any audio-visual speech data in those languages.Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer),which learns language-agnostic speech representations by predicting Roman text.Then, by leveraging the strong multilingual modeling capabilities of LargeLanguage Models (LLMs), we propose converting the predicted Roman text intolanguage-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking ita step further, we explore a unified Zero-AVSR approach by directly integratingthe audio-visual speech representations encoded by the AV-Romanizer into theLLM. This is achieved through finetuning the adapter and the LLM using ourproposed multi-task learning scheme. To capture the wide spectrum of phoneticand linguistic diversity, we also introduce a Multilingual Audio-VisualRomanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech dataacross 82 languages, along with transcriptions in both language-specificgraphemes and Roman text. Extensive analysis and experiments confirm that theproposed Zero-AVSR framework has the potential to expand language supportbeyond the languages seen during the training of the AV-Romanizer.

Code Repositories

JeongHun0716/zero-avsr
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-speech-recognition-on-lrs3-tedZero-AVSR
Word Error Rate (WER): 1.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations | Papers | HyperAI