HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer

{Sung Won Han Seungjin Lee Dongwon Kim Jin Sob Kim Hyun Joon Park WooSeok Shin}

Abstract

The performance of automated audio captioning (AAC) has been improved considerably through a transformer-based encoder and transfer learning. However, their performance improvement is constrained by the following problems: (1) discrepancy in the input patch size between pretraining and fine-tuning steps. (2) lack of local-level relations between inputs and captions. In this paper, we propose a simple transfer learning scheme that maintains input patch sizes, unlike previous methods, to avoid input discrepancies. Furthermore, we propose a patch-wise keyword estimation branch that utilizes an attention pooling method to effectively represent both global- and local-level information. The results on the AudioCaps dataset reveal that the proposed learning scheme and method considerably contribute to performance gain. Finally, the visualization results demonstrate that the proposed attention-pooling method effectively detects local-level information in the AAC system.

Benchmarks

BenchmarkMethodologyMetrics
audio-captioning-on-audiocapsRethink-ACT (AST + TF + MIL)
BLEU-4: 0.285
CIDEr: 0.764
METEOR: 0.242
ROUGE-L: 0.504
SPICE: 0.180
SPIDEr: 0.472

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer | Papers | HyperAI