HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

Schmid Florian ; Koutini Khaled ; Widmer Gerhard

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio
  Models

Abstract

The introduction of large-scale audio datasets, such as AudioSet, paved theway for Transformers to conquer the audio domain and replace CNNs as thestate-of-the-art neural network architecture for many tasks. Audio SpectrogramTransformers are excellent at exploiting large datasets, creating powerfulpre-trained models that surpass CNNs when fine-tuned on downstream tasks.However, current popular Audio Spectrogram Transformers are demanding in termsof computational complexity compared to CNNs. Recently, we have shown that, byemploying Transformer-to-CNN Knowledge Distillation, efficient CNNs can catchup with and even outperform Transformers on large datasets. In this work, weextend this line of research and increase the capacity of efficient CNNs byintroducing dynamic CNN blocks, constructed of dynamic non-linearities, dynamicconvolutions and attention mechanisms. We show that these dynamic CNNsoutperform traditional efficient CNNs, in terms of the performance-complexitytrade-off and parameter efficiency, at the task of audio tagging on thelarge-scale AudioSet. Our experiments further indicate that the introduceddynamic CNNs achieve better performance on downstream tasks and scale up well,attaining Transformer performance and even outperforming them on AudioSet andseveral downstream tasks.

Code Repositories

fschmid56/efficientat
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-classification-on-audiosetDyMN-L (Audio-Only, Single)
Test mAP: 0.490
audio-classification-on-esc-50DyMN-L
Accuracy (5-fold): 97.4
PRE-TRAINING DATASET: AudioSet
Top-1 Accuracy: 97.4
audio-classification-on-fsd50kMN
mAP: 65.6
audio-classification-on-fsd50kDyMN-L
mAP: 65.5
audio-tagging-on-audiosetDyMN-L (Audio-Only, Single)
mean average precision: 0.490
instrument-recognition-on-openmic-2018DyMN-L
mean average precision: 0.855

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models | Papers | HyperAI