HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Jiawei Huang; Yi Ren; Rongjie Huang; Dongchao Yang; Zhenhui Ye; Chen Zhang; Jinglin Liu; Xiang Yin; Zejun Ma; Zhou Zhao

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Abstract

Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event & order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.

Code Repositories

bytedance/make-an-audio-2
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-generation-on-audiocapsMake-An-Audio 2
FAD: 1.80
FD: 11.75

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation | Papers | HyperAI