8 months ago

Diffusion Model

Method/Architecture

Zach Evans CJ Carr Josiah Taylor Scott H. Hawley Jordi Pons

Abstract

Generating long-form 44.1kHz stereo audio from text prompts can becomputationally demanding. Further, most previous works do not tackle thatmusic and sound effects naturally vary in their duration. Our research focuseson the efficient generation of long-form, variable-length stereo music andsounds at 44.1kHz using text prompts with a generative model. Stable Audio isbased on latent diffusion, with its latent defined by a fully-convolutionalvariational autoencoder. It is conditioned on text prompts as well as timingembeddings, allowing for fine control over both the content and length of thegenerated music and sounds. Stable Audio is capable of rendering stereo signalsof up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its computeefficiency and fast inference, it is one of the best in two publictext-to-music and -audio benchmarks and, differently from state-of-the-artmodels, can generate music with structure and stereo sounds.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Diffusion Model

Method/Architecture

Zach Evans CJ Carr Josiah Taylor Scott H. Hawley Jordi Pons

Abstract

Generating long-form 44.1kHz stereo audio from text prompts can becomputationally demanding. Further, most previous works do not tackle thatmusic and sound effects naturally vary in their duration. Our research focuseson the efficient generation of long-form, variable-length stereo music andsounds at 44.1kHz using text prompts with a generative model. Stable Audio isbased on latent diffusion, with its latent defined by a fully-convolutionalvariational autoencoder. It is conditioned on text prompts as well as timingembeddings, allowing for fine control over both the content and length of thegenerated music and sounds. Stable Audio is capable of rendering stereo signalsof up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its computeefficiency and fast inference, it is one of the best in two publictext-to-music and -audio benchmarks and, differently from state-of-the-artmodels, can generate music with structure and stereo sounds.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp