HyperAIHyperAI

Command Palette

Search for a command to run...

AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS

Christophe Cerisara Romain Serizel F ́elix Gontier

Abstract

utomated audio captioning is the multimodal task of describingenvironmental audio recordings with fluent natural language. Mostcurrent methods utilize pre-trained analysis models to extract rele-vant semantic content from the audio input. However, prior infor-mation on language modeling is rarely introduced, and correspond-ing architectures are limited in capacity due to data scarcity. Inthis paper, we present a method leveraging the linguistic informa-tion contained in BART, a large-scale conditional language modelwith general purpose pre-training. The caption generation is condi-tioned on sequences of textual AudioSet tags. This input is enrichedwith temporally aligned audio embeddings that allows the model toimprove the sound event recognition. The full BART architectureis fine-tuned with few additional parameters. Experimental resultsdemonstrate that, beyond the scaling properties of the architecture,language-only pre-training improves the text quality in the multi-modal setting of audio captioning. The best model achieves state-of-the-art performance on AudioCaps with 46.5 SPIDEr.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp