HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Fast Timing-Conditioned Latent Audio Diffusion

Evans Zach ; Carr CJ ; Taylor Josiah ; Hawley Scott H. ; Pons Jordi

Fast Timing-Conditioned Latent Audio Diffusion

Abstract

Generating long-form 44.1kHz stereo audio from text prompts can becomputationally demanding. Further, most previous works do not tackle thatmusic and sound effects naturally vary in their duration. Our research focuseson the efficient generation of long-form, variable-length stereo music andsounds at 44.1kHz using text prompts with a generative model. Stable Audio isbased on latent diffusion, with its latent defined by a fully-convolutionalvariational autoencoder. It is conditioned on text prompts as well as timingembeddings, allowing for fine control over both the content and length of thegenerated music and sounds. Stable Audio is capable of rendering stereo signalsof up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its computeefficiency and fast inference, it is one of the best in two publictext-to-music and -audio benchmarks and, differently from state-of-the-artmodels, can generate music with structure and stereo sounds.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
audio-generation-on-audiocapsStable Audio
CLAP_LAION: 0.41
FD_openl3: 103.66
KL_passt: 2.89
text-to-music-generation-on-musiccapsStable Audio
FD_openl3: 108.69
KL_passt: 0.80

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Fast Timing-Conditioned Latent Audio Diffusion | Papers | HyperAI