8 months ago

Image Captioning

Audio and Speech Processing

Leonard Salewski Stefan Fauth A. Sophia Koepke Zeynep Akata

Abstract

Zero-shot audio captioning aims at automatically generating descriptivetextual captions for audio content without prior training for this task.Different from speech recognition which translates audio content that containsspoken language into text, audio captioning is commonly concerned with ambientsounds, or sounds produced by a human performing an action. Inspired byzero-shot image captioning methods, we propose ZerAuCap, a novel framework forsummarising such general audio signals in a text caption without requiringtask-specific training. In particular, our framework exploits a pre-trainedlarge language model (LLM) for generating the text which is guided by apre-trained audio-language model to produce captions that describe the audiocontent. Additionally, we use audio context keywords that prompt the languagemodel to generate text that is broadly relevant to sounds. Our proposedframework achieves state-of-the-art results in zero-shot audio captioning onthe AudioCaps and Clotho datasets. Our code is available athttps://github.com/ExplainableML/ZerAuCap.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Image Captioning

Audio and Speech Processing

Leonard Salewski Stefan Fauth A. Sophia Koepke Zeynep Akata

Abstract

Zero-shot audio captioning aims at automatically generating descriptivetextual captions for audio content without prior training for this task.Different from speech recognition which translates audio content that containsspoken language into text, audio captioning is commonly concerned with ambientsounds, or sounds produced by a human performing an action. Inspired byzero-shot image captioning methods, we propose ZerAuCap, a novel framework forsummarising such general audio signals in a text caption without requiringtask-specific training. In particular, our framework exploits a pre-trainedlarge language model (LLM) for generating the text which is guided by apre-trained audio-language model to produce captions that describe the audiocontent. Additionally, we use audio context keywords that prompt the languagemodel to generate text that is broadly relevant to sounds. Our proposedframework achieves state-of-the-art results in zero-shot audio captioning onthe AudioCaps and Clotho datasets. Our code is available athttps://github.com/ExplainableML/ZerAuCap.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp