HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Zero-Shot Audio Captioning via Audibility Guidance

Shaharabany Tal ; Shaulov Ariel ; Wolf Lior

Zero-Shot Audio Captioning via Audibility Guidance

Abstract

The task of audio captioning is similar in essence to tasks such as image andvideo captioning. However, it has received much less attention. We proposethree desiderata for captioning audio -- (i) fluency of the generated text,(ii) faithfulness of the generated text to the input audio, and the somewhatrelated (iii) audibility, which is the quality of being able to be perceivedbased only on audio. Our method is a zero-shot method, i.e., we do not learn toperform captioning. Instead, captioning occurs as an inference process thatinvolves three networks that correspond to the three desired qualities: (i) ALarge Language Model, in our case, for reasons of convenience, GPT-2, (ii) Amodel that provides a matching score between an audio file and a text, forwhich we use a multimodal matching network called ImageBind, and (iii) A textclassifier, trained using a dataset we collected automatically by instructingGPT-4 with prompts designed to direct the generation of both audible andinaudible sentences. We present our results on the AudioCap dataset,demonstrating that audibility guidance significantly enhances performancecompared to the baseline, which lacks this objective.

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-audio-captioning-on-audiocapsShaharabany et al.
BLEU-4: 9.8
CIDEr: 9.2
METEOR: 8.6
ROUGE-L: 8.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Zero-Shot Audio Captioning via Audibility Guidance | Papers | HyperAI