5 months ago

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Wang Heng ; Ma Jianbo ; Pascual Santiago ; Cartwright Richard ; Cai Weidong

Abstract

Building artificial intelligence (AI) systems on top of a set of foundationmodels (FMs) is becoming a new paradigm in AI research. Their representativeand generative abilities learnt from vast amounts of data can be easily adaptedand transferred to a wide range of downstream tasks without extra training fromscratch. However, leveraging FMs in cross-modal generation remainsunder-researched when audio modality is involved. On the other hand,automatically generating semantically-relevant sound from visual input is animportant problem in cross-modal generation studies. To solve thisvision-to-audio (V2A) generation problem, existing methods tend to design andbuild complex systems from scratch using modestly sized datasets. In thispaper, we propose a lightweight solution to this problem by leveragingfoundation models, specifically CLIP, CLAP, and AudioLDM. We first investigatethe domain gap between the latent space of the visual CLIP and the auditoryCLAP models. Then we propose a simple yet effective mapper mechanism(V2A-Mapper) to bridge the domain gap by translating the visual input betweenCLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrainedaudio generative FM AudioLDM is adopted to produce high-fidelity andvisually-aligned sound. Compared to previous approaches, our method onlyrequires a quick training of the V2A-Mapper. We further analyze and conductextensive experiments on the choice of the V2A-Mapper and show that agenerative mapper is better at fidelity and variability (FD) while a regressionmapper is slightly better at relevance (CS). Both objective and subjectiveevaluation on two V2A datasets demonstrate the superiority of our proposedmethod compared to current state-of-the-art approaches - trained with 86% fewerparameters but achieving 53% and 19% improvement in FD and CS, respectively.

Code Repositories

heng-hw/V2A-Mapper

Official

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-to-sound-generation-on-vgg-sound	V2A-Mapper	FAD: 0.841 FD: 24.168

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette