Command Palette
Search for a command to run...
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
Chia-Yu Hung Navonil Majumder Zhifeng Kong Ambuj Mehrish Rafael Valle Bryan Catanzaro Soujanya Poria

Abstract
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative modelwith 515M parameters, capable of generating up to 30 seconds of 44.1kHz audioin just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA modelslies in the difficulty of creating preference pairs, as TTA lacks structuredmechanisms like verifiable rewards or gold-standard answers available for LargeLanguage Models (LLMs). To address this, we propose CLAP-Ranked PreferenceOptimization (CRPO), a novel framework that iteratively generates and optimizespreference data to enhance TTA alignment. We demonstrate that the audiopreference dataset generated using CRPO outperforms existing alternatives. Withthis framework, TangoFlux achieves state-of-the-art performance across bothobjective and subjective benchmarks. We open source all code and models tosupport further research in TTA generation.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-generation-on-audiocaps | TangoFlux | CLAP_LAION: 0.488 FD_openl3: 75.1 IS: 12.2 KL_passt: 1.15 |
| audio-generation-on-audiocaps | TangoFlux-base | CLAP_LAION: 0.438 FD_openl3: 79.7 IS: 10.7 KL_passt: 1.23 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.