HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

ETTA: Elucidating the Design Space of Text-to-Audio Models

Lee Sang-gil ; Kong Zhifeng ; Goel Arushi ; Kim Sungwon ; Valle Rafael ; Catanzaro Bryan

ETTA: Elucidating the Design Space of Text-to-Audio Models

Abstract

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis,enabling users to enrich their creative workflows with synthetic audiogenerated from natural language prompts. Despite this progress, the effects ofdata, model architecture, training objective functions, and sampling strategieson target benchmarks are not well understood. With the purpose of providing aholistic understanding of the design space of TTA models, we set up alarge-scale empirical experiment focused on diffusion and flow matching models.Our contributions include: 1) AF-Synthetic, a large dataset of high qualitysynthetic captions obtained from an audio understanding model; 2) a systematiccomparison of different architectural, training, and inference design choicesfor TTA models; 3) an analysis of sampling methods and their Pareto curves withrespect to generation quality and inference speed. We leverage the knowledgeobtained from this extensive analysis to propose our best model dubbedElucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps,ETTA provides improvements over the baselines trained on publicly availabledata, while being competitive with models trained on proprietary data. Finally,we show ETTA's improved ability to generate creative audio following complexand imaginative captions -- a task that is more challenging than currentbenchmarks.

Benchmarks

BenchmarkMethodologyMetrics
audio-generation-on-audiocapsETTA
CLAP_LAION: 0.54
CLAP_MS: 0.43
FAD: 2.51
FD: 13.12
FD_openl3: 80.13
IS: 14.36
KL_passt: 1.22
audio-generation-on-audiocapsETTA-FT-AC-100k
CLAP_LAION: 0.60
CLAP_MS: 0.43
FAD: 2.03
FD: 10.10
FD_openl3: 61.79
IS: 14.29
KL_passt: 1.13
text-to-music-generation-on-musiccapsETTA
CLAP_LAION: 0.51
CLAP_MS: 0.53
FAD: 1.91
FD: 10.06
FD_openl3: 92.18
IS: 3.32
KL_passt: 0.84

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ETTA: Elucidating the Design Space of Text-to-Audio Models | Papers | HyperAI