Command Palette
Search for a command to run...
Evans Zach ; Carr CJ ; Taylor Josiah ; Hawley Scott H. ; Pons Jordi

Abstract
Generating long-form 44.1kHz stereo audio from text prompts can becomputationally demanding. Further, most previous works do not tackle thatmusic and sound effects naturally vary in their duration. Our research focuseson the efficient generation of long-form, variable-length stereo music andsounds at 44.1kHz using text prompts with a generative model. Stable Audio isbased on latent diffusion, with its latent defined by a fully-convolutionalvariational autoencoder. It is conditioned on text prompts as well as timingembeddings, allowing for fine control over both the content and length of thegenerated music and sounds. Stable Audio is capable of rendering stereo signalsof up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its computeefficiency and fast inference, it is one of the best in two publictext-to-music and -audio benchmarks and, differently from state-of-the-artmodels, can generate music with structure and stereo sounds.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-generation-on-audiocaps | Stable Audio | CLAP_LAION: 0.41 FD_openl3: 103.66 KL_passt: 2.89 |
| text-to-music-generation-on-musiccaps | Stable Audio | FD_openl3: 108.69 KL_passt: 0.80 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.