HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Jizhong Liu Gang Li Junbo Zhang Heinrich Dinkel Yongqing Wang Zhiyong Yan Yujun Wang Bin Wang

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Abstract

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

Code Repositories

frankenliu/LOAE
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-captioning-on-audiocapsLOAE
CIDEr: 0.816
FENSE: 0.664
METEOR: 0.267
SPICE: 0.193
SPIDEr: 0.505
Sentence-BERT: 0.664
audio-captioning-on-clothoLOAE
CIDEr: 0.513
FENSE: 0.538
METEOR: 0.197
SPICE: 0.147
SPIDEr: 0.330
SPIDEr-FL: 0.330
Sentence-BERT: 0.538

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding | Papers | HyperAI