HyperAIHyperAI

Command Palette

Search for a command to run...

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng Masato Ishii Akio Hayakawa Takashi Shibuya Alexander Schwing Yuki Mitsufuji

Abstract

We propose to synthesize high-quality and synchronized audio, given video andoptional text conditions, using a novel multimodal joint training frameworkMMAudio. In contrast to single-modality training conditioned on (limited) videodata only, MMAudio is jointly trained with larger-scale, readily availabletext-audio data to learn to generate semantically aligned high-quality audiosamples. Additionally, we improve audio-visual synchrony with a conditionalsynchronization module that aligns video conditions with audio latents at theframe level. Trained with a flow matching objective, MMAudio achieves newvideo-to-audio state-of-the-art among public models in terms of audio quality,semantic alignment, and audio-visual synchronization, while having a lowinference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudioalso achieves surprisingly competitive performance in text-to-audio generation,showing that joint training does not hinder single-modality performance. Codeand demo are available at: https://hkchengrex.github.io/MMAudio


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis | Papers | HyperAI