5 months ago

FLUX that Plays Music

Zhengcong Fei Mingyuan Fan Changqian Yu Junshi Huang

Abstract

This paper explores a simple extension of diffusion-based rectified flowTransformers for text-to-music generation, termed as FluxMusic. Generally,along with design in advancedFluxhttps://github.com/black-forest-labs/flux model, we transfers itinto a latent VAE space of mel-spectrum. It involves first applying a sequenceof independent attention to the double text-music stream, followed by a stackedsingle music stream for denoised patch prediction. We employ multiplepre-trained text encoders to sufficiently capture caption semantic informationas well as inference flexibility. In between, coarse textual information, inconjunction with time step embeddings, is utilized in a modulation mechanism,while fine-grained textual details are concatenated with the music patchsequence as inputs. Through an in-depth study, we demonstrate that rectifiedflow training with an optimized architecture significantly outperformsestablished diffusion methods for the text-to-music task, as evidenced byvarious automatic metrics and human preference evaluations. Our experimentaldata, code, and model weights are made publicly available at:https://github.com/feizc/FluxMusic.

Code Repositories

black-forest-labs/flux

Official

pytorch

feizc/fluxmusic

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
text-to-music-generation-on-musiccaps	FLUXMusic	FAD: 1.43 IS: 2.98 KL_passt: 1.25

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette