7 months ago

Multimodal Representation

Audio and Speech Processing

Christophe Cerisara Romain Serizel F ́elix Gontier

Abstract

utomated audio captioning is the multimodal task of describingenvironmental audio recordings with fluent natural language. Mostcurrent methods utilize pre-trained analysis models to extract rele-vant semantic content from the audio input. However, prior infor-mation on language modeling is rarely introduced, and correspond-ing architectures are limited in capacity due to data scarcity. Inthis paper, we present a method leveraging the linguistic informa-tion contained in BART, a large-scale conditional language modelwith general purpose pre-training. The caption generation is condi-tioned on sequences of textual AudioSet tags. This input is enrichedwith temporally aligned audio embeddings that allows the model toimprove the sound event recognition. The full BART architectureis fine-tuned with few additional parameters. Experimental resultsdemonstrate that, beyond the scaling properties of the architecture,language-only pre-training improves the text quality in the multi-modal setting of audio captioning. The best model achieves state-of-the-art performance on AudioCaps with 46.5 SPIDEr.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

7 months ago

Multimodal Representation

Audio and Speech Processing

Christophe Cerisara Romain Serizel F ́elix Gontier

Abstract

utomated audio captioning is the multimodal task of describingenvironmental audio recordings with fluent natural language. Mostcurrent methods utilize pre-trained analysis models to extract rele-vant semantic content from the audio input. However, prior infor-mation on language modeling is rarely introduced, and correspond-ing architectures are limited in capacity due to data scarcity. Inthis paper, we present a method leveraging the linguistic informa-tion contained in BART, a large-scale conditional language modelwith general purpose pre-training. The caption generation is condi-tioned on sequences of textual AudioSet tags. This input is enrichedwith temporally aligned audio embeddings that allows the model toimprove the sound event recognition. The full BART architectureis fine-tuned with few additional parameters. Experimental resultsdemonstrate that, beyond the scaling properties of the architecture,language-only pre-training improves the text quality in the multi-modal setting of audio captioning. The best model achieves state-of-the-art performance on AudioCaps with 46.5 SPIDEr.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp