HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng Masato Ishii Akio Hayakawa Takashi Shibuya Alexander Schwing Yuki Mitsufuji

Taming Multimodal Joint Training for High-Quality Video-to-Audio
  Synthesis

Abstract

We propose to synthesize high-quality and synchronized audio, given video andoptional text conditions, using a novel multimodal joint training frameworkMMAudio. In contrast to single-modality training conditioned on (limited) videodata only, MMAudio is jointly trained with larger-scale, readily availabletext-audio data to learn to generate semantically aligned high-quality audiosamples. Additionally, we improve audio-visual synchrony with a conditionalsynchronization module that aligns video conditions with audio latents at theframe level. Trained with a flow matching objective, MMAudio achieves newvideo-to-audio state-of-the-art among public models in terms of audio quality,semantic alignment, and audio-visual synchronization, while having a lowinference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudioalso achieves surprisingly competitive performance in text-to-audio generation,showing that joint training does not hinder single-modality performance. Codeand demo are available at: https://hkchengrex.github.io/MMAudio

Code Repositories

hkchengrex/MMAudio
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-to-sound-generation-on-vgg-soundMMAudio-S-16kHz
FAD: 0.79
FD: 5.22
video-to-sound-generation-on-vgg-soundMMAudio-L-44.1kHz
FAD: 0.97
FD: 4.72

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis | Papers | HyperAI