HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS

{Christophe Cerisara Romain Serizel F ́elix Gontier}

AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS

Abstract

utomated audio captioning is the multimodal task of describingenvironmental audio recordings with fluent natural language. Mostcurrent methods utilize pre-trained analysis models to extract rele-vant semantic content from the audio input. However, prior infor-mation on language modeling is rarely introduced, and correspond-ing architectures are limited in capacity due to data scarcity. Inthis paper, we present a method leveraging the linguistic informa-tion contained in BART, a large-scale conditional language modelwith general purpose pre-training. The caption generation is condi-tioned on sequences of textual AudioSet tags. This input is enrichedwith temporally aligned audio embeddings that allows the model toimprove the sound event recognition. The full BART architectureis fine-tuned with few additional parameters. Experimental resultsdemonstrate that, beyond the scaling properties of the architecture,language-only pre-training improves the text quality in the multi-modal setting of audio captioning. The best model achieves state-of-the-art performance on AudioCaps with 46.5 SPIDEr.

Benchmarks

BenchmarkMethodologyMetrics
audio-captioning-on-audiocapsBART + YAMNet + PANNs
CIDEr: 0.753
SPICE: 0.176
SPIDEr: 0.465
retrieval-augmented-few-shot-in-context-audioAutomated audio captioning by fine-tuning bart with audioset tags
CIDEr: 0.147

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS | Papers | HyperAI