5 months ago

Audio Captioning Transformer

Mei Xinhao ; Liu Xubo ; Huang Qiushi ; Plumbley Mark D. ; Wang Wenwu

Abstract

Audio captioning aims to automatically generate a natural languagedescription of an audio clip. Most captioning models follow an encoder-decoderarchitecture, where the decoder predicts words based on the audio featuresextracted by the encoder. Convolutional neural networks (CNNs) and recurrentneural networks (RNNs) are often used as the audio encoder. However, CNNs canbe limited in modelling temporal relationships among the time frames in anaudio signal, while RNNs can be limited in modelling the long-rangedependencies among the time frames. In this paper, we propose an AudioCaptioning Transformer (ACT), which is a full Transformer network based on anencoder-decoder architecture and is totally convolution-free. The proposedmethod has a better ability to model the global information within an audiosignal as well as capture temporal relationships between audio events. Weevaluate our model on AudioCaps, which is the largest audio captioning datasetpublicly available. Our model shows competitive performance compared to otherstate-of-the-art approaches.

Code Repositories

XinhaoMei/ACT

Official

pytorch

Benchmarks

Benchmark	Methodology	Metrics
audio-captioning-on-audiocaps	CNN+Transformer	CIDEr: 0.693 SPICE: 0.159 SPIDEr: 0.426
retrieval-augmented-few-shot-in-context-audio	Audio captioning transformer	CIDEr: 0.149

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette