Command Palette
Search for a command to run...
Kong Zhifeng ; Lee Sang-gil ; Ghosal Deepanway ; Majumder Navonil ; Mehrish Ambuj ; Valle Rafael ; Poria Soujanya ; Catanzaro Bryan

Abstract
It is an open challenge to obtain high quality training data, especiallycaptions, for text-to-audio models. Although prior methods have leveraged\textit{text-only language models} to augment and improve captions, suchmethods have limitations related to scale and coherence between audio andcaptions. In this work, we propose an audio captioning pipeline that uses an\textit{audio language model} to synthesize accurate and diverse captions foraudio at scale. We leverage this pipeline to produce a dataset of syntheticcaptions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate thebenefit of pre-training text-to-audio models on these synthetic captions.Through systematic evaluations on AudioCaps and MusicCaps, we find leveragingour pipeline and synthetic captions leads to significant improvements on audiogeneration quality, achieving a new \textit{state-of-the-art}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-generation-on-audiocaps | Tango-AF&AC-FT-AC | CLAP_LAION: 0.527 FAD: 2.54 FD: 17.19 IS: 11.04 |
| text-to-music-generation-on-musiccaps | TANGO-AF | CLAP_LAION: 0.51 CLAP_MS: 0.43 FAD: 2.21 FD: 22.69 FD_openl3: 270.32 IS: 2.79 KL_passt: 0.94 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.