HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Ke Chen Xingjian Du Bilei Zhu Zejun Ma Taylor Berg-Kirkpatrick Shlomo Dubnov

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Abstract

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

Code Repositories

retrocirce/hts-audio-transformer
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-classification-on-audiosetHTS-AT (Ensemble)
Test mAP: 0.487
audio-classification-on-esc-50HTS-AT
Accuracy (5-fold): 97.0
PRE-TRAINING DATASET: AudioSet
Top-1 Accuracy: 97.0
keyword-spotting-on-google-speech-commandsHTS-AT
Google Speech Commands V2 35: 98.0
sound-event-detection-on-desedHTS-AT
event-based F1 score: 50.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Papers | HyperAI