HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Audio Captioning Transformer

Mei Xinhao ; Liu Xubo ; Huang Qiushi ; Plumbley Mark D. ; Wang Wenwu

Audio Captioning Transformer

Abstract

Audio captioning aims to automatically generate a natural languagedescription of an audio clip. Most captioning models follow an encoder-decoderarchitecture, where the decoder predicts words based on the audio featuresextracted by the encoder. Convolutional neural networks (CNNs) and recurrentneural networks (RNNs) are often used as the audio encoder. However, CNNs canbe limited in modelling temporal relationships among the time frames in anaudio signal, while RNNs can be limited in modelling the long-rangedependencies among the time frames. In this paper, we propose an AudioCaptioning Transformer (ACT), which is a full Transformer network based on anencoder-decoder architecture and is totally convolution-free. The proposedmethod has a better ability to model the global information within an audiosignal as well as capture temporal relationships between audio events. Weevaluate our model on AudioCaps, which is the largest audio captioning datasetpublicly available. Our model shows competitive performance compared to otherstate-of-the-art approaches.

Code Repositories

XinhaoMei/ACT
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
audio-captioning-on-audiocapsCNN+Transformer
CIDEr: 0.693
SPICE: 0.159
SPIDEr: 0.426
retrieval-augmented-few-shot-in-context-audioAudio captioning transformer
CIDEr: 0.149

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Audio Captioning Transformer | Papers | HyperAI