8 months ago

Abstract

The task of audio captioning is similar in essence to tasks such as image andvideo captioning. However, it has received much less attention. We proposethree desiderata for captioning audio -- (i) fluency of the generated text,(ii) faithfulness of the generated text to the input audio, and the somewhatrelated (iii) audibility, which is the quality of being able to be perceivedbased only on audio. Our method is a zero-shot method, i.e., we do not learn toperform captioning. Instead, captioning occurs as an inference process thatinvolves three networks that correspond to the three desired qualities: (i) ALarge Language Model, in our case, for reasons of convenience, GPT-2, (ii) Amodel that provides a matching score between an audio file and a text, forwhich we use a multimodal matching network called ImageBind, and (iii) A textclassifier, trained using a dataset we collected automatically by instructingGPT-4 with prompts designed to direct the generation of both audible andinaudible sentences. We present our results on the AudioCap dataset,demonstrating that audibility guidance significantly enhances performancecompared to the baseline, which lacks this objective.

Source PDF