HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

Deepanway Ghosal Navonil Majumder Ambuj Mehrish Soujanya Poria

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

Abstract

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

Code Repositories

declare-lab/tango
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-generation-on-audiocapsTANGO
FAD: 1.59
FD: 24.52

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model | Papers | HyperAI