8 months ago

Haohe Liu Yi Yuan Xubo Liu Xinhao Mei Qiuqiang Kong Qiao Tian Yuping Wang Wenwu Wang Yuxuan Wang Mark D. Plumbley

Abstract

Although audio generation shares commonalities across different types ofaudio, such as speech, music, and sound effects, designing models for each typerequires careful consideration of specific objectives and biases that cansignificantly differ from those of other types. To bring us closer to a unifiedperspective of audio generation, this paper proposes a framework that utilizesthe same learning method for speech, music, and sound effect generation. Ourframework introduces a general representation of audio, called "language ofaudio" (LOA). Any audio can be translated into LOA based on AudioMAE, aself-supervised pre-trained representation learning model. In the generationprocess, we translate any modalities into LOA by using a GPT-2 model, and weperform self-supervised audio generation learning with a latent diffusion modelconditioned on LOA. The proposed framework naturally brings advantages such asin-context learning abilities and reusable self-supervised pretrained AudioMAEand latent diffusion models. Experiments on the major benchmarks oftext-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-artor competitive performance against previous approaches. Our code, pretrainedmodel, and demo are available at https://audioldm.github.io/audioldm2.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Haohe Liu Yi Yuan Xubo Liu Xinhao Mei Qiuqiang Kong Qiao Tian Yuping Wang Wenwu Wang Yuxuan Wang Mark D. Plumbley

Abstract

Although audio generation shares commonalities across different types ofaudio, such as speech, music, and sound effects, designing models for each typerequires careful consideration of specific objectives and biases that cansignificantly differ from those of other types. To bring us closer to a unifiedperspective of audio generation, this paper proposes a framework that utilizesthe same learning method for speech, music, and sound effect generation. Ourframework introduces a general representation of audio, called "language ofaudio" (LOA). Any audio can be translated into LOA based on AudioMAE, aself-supervised pre-trained representation learning model. In the generationprocess, we translate any modalities into LOA by using a GPT-2 model, and weperform self-supervised audio generation learning with a latent diffusion modelconditioned on LOA. The proposed framework naturally brings advantages such asin-context learning abilities and reusable self-supervised pretrained AudioMAEand latent diffusion models. Experiments on the major benchmarks oftext-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-artor competitive performance against previous approaches. Our code, pretrainedmodel, and demo are available at https://audioldm.github.io/audioldm2.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp