HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Read, Watch and Scream! Sound Generation from Text and Video

Jeong Yujin ; Kim Yunji ; Chun Sanghyuk ; Lee Jiyoung

Read, Watch and Scream! Sound Generation from Text and Video

Abstract

Despite the impressive progress of multimodal generative models,video-to-audio generation still suffers from limited performance and limits theflexibility to prioritize sound synthesis for specific objects within thescene. Conversely, text-to-audio generation methods generate high-quality audiobut pose challenges in ensuring comprehensive scene depiction and time-varyingcontrol. To tackle these challenges, we propose a novel video-and-text-to-audiogeneration method, called \ours, where video serves as a conditional controlfor a text-to-audio generation model. Especially, our method estimates thestructural information of sound (namely, energy) from the video while receivingkey content cues from a user prompt. We employ a well-performing text-to-audiomodel to consolidate the video control, which is much more efficient fortraining multimodal diffusion models with massive triplet-paired(audio-video-text) data. In addition, by separating the generative componentsof audio, it becomes a more flexible system that allows users to freely adjustthe energy, surrounding environment, and primary sound source according totheir preferences. Experimental results demonstrate that our method showssuperiority in terms of quality, controllability, and training efficiency. Codeand demo are available at https://naver-ai.github.io/rewas.

Code Repositories

naver-ai/rewas
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-to-sound-generation-on-vgg-soundReWas
FAD: 2.16
FD: 15.24

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Read, Watch and Scream! Sound Generation from Text and Video | Papers | HyperAI