Command Palette
Search for a command to run...
Zhengcong Fei Mingyuan Fan Changqian Yu Junshi Huang

Abstract
This paper explores a simple extension of diffusion-based rectified flowTransformers for text-to-music generation, termed as FluxMusic. Generally,along with design in advancedFluxhttps://github.com/black-forest-labs/flux model, we transfers itinto a latent VAE space of mel-spectrum. It involves first applying a sequenceof independent attention to the double text-music stream, followed by a stackedsingle music stream for denoised patch prediction. We employ multiplepre-trained text encoders to sufficiently capture caption semantic informationas well as inference flexibility. In between, coarse textual information, inconjunction with time step embeddings, is utilized in a modulation mechanism,while fine-grained textual details are concatenated with the music patchsequence as inputs. Through an in-depth study, we demonstrate that rectifiedflow training with an optimized architecture significantly outperformsestablished diffusion methods for the text-to-music task, as evidenced byvarious automatic metrics and human preference evaluations. Our experimentaldata, code, and model weights are made publicly available at:https://github.com/feizc/FluxMusic.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-to-music-generation-on-musiccaps | FLUXMusic | FAD: 1.43 IS: 2.98 KL_passt: 1.25 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.