HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Taming Data and Transformers for Audio Generation

Moayed Haji-Ali Willi Menapace Aliaksandr Siarohin Guha Balakrishnan Vicente Ordonez

Taming Data and Transformers for Audio Generation

Abstract

The scalability of ambient sound generators is hindered by data scarcity, insufficient caption quality, and limited scalability in model architecture. This work addresses these challenges by advancing both data and model scaling. First, we propose an efficient and scalable dataset collection pipeline tailored for ambient audio generation, resulting in AutoReCap-XL, the largest ambient audio-text dataset with over 47 million clips. To provide high-quality textual annotations, we propose AutoCap, a high-quality automatic audio captioning model. By adopting a Q-Former module and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of $83.2$, a $3.2\%$ improvement over previous captioning models. Finally, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. We demonstrate its benefits from data scaling with synthetic captions as well as model size scaling. When compared to baseline audio generators trained at similar size and data scale, GenAu obtains significant improvements of $4.7\%$ in FAD score, $11.1\%$ in IS, and $13.5\%$ in CLAP score. Our code, model checkpoints, and dataset are publicly available.

Code Repositories

snap-research/GenAU
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-captioning-on-audiocapsAutoCap
CIDEr: 0.832
METEOR: 0.253
ROUGE: 0.518
ROUGE-L: 0.518
SPICE: 0.182
SPIDEr: 0.507
audio-generation-on-audiocapsGenAu-Large
CLAP_MS: 0.668
FAD: 1.21
FD: 16.51

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Taming Data and Transformers for Audio Generation | Papers | HyperAI