Command Palette
Search for a command to run...
Lee Sang-gil ; Kong Zhifeng ; Goel Arushi ; Kim Sungwon ; Valle Rafael ; Catanzaro Bryan

Abstract
Recent years have seen significant progress in Text-To-Audio (TTA) synthesis,enabling users to enrich their creative workflows with synthetic audiogenerated from natural language prompts. Despite this progress, the effects ofdata, model architecture, training objective functions, and sampling strategieson target benchmarks are not well understood. With the purpose of providing aholistic understanding of the design space of TTA models, we set up alarge-scale empirical experiment focused on diffusion and flow matching models.Our contributions include: 1) AF-Synthetic, a large dataset of high qualitysynthetic captions obtained from an audio understanding model; 2) a systematiccomparison of different architectural, training, and inference design choicesfor TTA models; 3) an analysis of sampling methods and their Pareto curves withrespect to generation quality and inference speed. We leverage the knowledgeobtained from this extensive analysis to propose our best model dubbedElucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps,ETTA provides improvements over the baselines trained on publicly availabledata, while being competitive with models trained on proprietary data. Finally,we show ETTA's improved ability to generate creative audio following complexand imaginative captions -- a task that is more challenging than currentbenchmarks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-generation-on-audiocaps | ETTA | CLAP_LAION: 0.54 CLAP_MS: 0.43 FAD: 2.51 FD: 13.12 FD_openl3: 80.13 IS: 14.36 KL_passt: 1.22 |
| audio-generation-on-audiocaps | ETTA-FT-AC-100k | CLAP_LAION: 0.60 CLAP_MS: 0.43 FAD: 2.03 FD: 10.10 FD_openl3: 61.79 IS: 14.29 KL_passt: 1.13 |
| text-to-music-generation-on-musiccaps | ETTA | CLAP_LAION: 0.51 CLAP_MS: 0.53 FAD: 1.91 FD: 10.06 FD_openl3: 92.18 IS: 3.32 KL_passt: 0.84 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.