HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Tell What You Hear From What You See -- Video to Audio Generation Through Text

Liu Xiulong ; Su Kun ; Shlizerman Eli

Tell What You Hear From What You See -- Video to Audio Generation
  Through Text

Abstract

The content of visual and audio scenes is multi-faceted such that a video canbe paired with various audio and vice-versa. Thereby, in video-to-audiogeneration task, it is imperative to introduce steering approaches forcontrolling the generated audio. While Video-to-Audio generation is awell-established generative task, existing methods lack such controllability.In this work, we propose VATT, a multi-modal generative framework that takes avideo and an optional text prompt as input, and generates audio and optionaltextual description of the audio. Such a framework has two advantages: i)Video-to-Audio generation process can be refined and controlled via text whichcomplements the context of visual information, and ii) The model can suggestwhat audio to generate for the video by generating audio captions. VATTconsists of two key modules: VATT Converter, a LLM that is fine-tuned forinstructions and includes a projection layer that maps video features to theLLM vector space; and VATT Audio, a transformer that generates audio tokensfrom visual frames and from optional text prompt using iterative paralleldecoding. The audio tokens are converted to a waveform by pretrained neuralcodec. Experiments show that when VATT is compared to existing video-to-audiogeneration methods in objective metrics, it achieves competitive performancewhen the audio caption is not provided. When the audio caption is provided as aprompt, VATT achieves even more refined performance (lowest KLD score of 1.41).Furthermore, subjective studies show that VATT Audio has been chosen aspreferred generated audio than audio generated by existing methods. VATTenables controllable video-to-audio generation through text as well assuggesting text prompts for videos through audio captions, unlocking novelapplications such as text-guided video-to-audio generation and video-to-audiocaptioning.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
video-to-sound-generation-on-vgg-soundVATT-LLama
FAD: 2.38
KLD: 1.41

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Tell What You Hear From What You See -- Video to Audio Generation Through Text | Papers | HyperAI