Command Palette
Search for a command to run...
Chen Wenxi ; Liang Yuzhe ; Ma Ziyang ; Zheng Zhisheng ; Chen Xie

Abstract
Audio self-supervised learning (SSL) pre-training, which aims to learn goodrepresentations from unlabeled audio, has made remarkable progress. However,the extensive computational demands during pre-training pose a significantbarrier to the potential application and optimization of audio SSL models. Inthis paper, inspired by the success of data2vec 2.0 in image modality andAudio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) tofurther improve the effectiveness and efficiency in audio SSL. The proposed EATadopts the bootstrap self-supervised training paradigm to the audio domain. Anovel Utterance-Frame Objective (UFO) is designed to enhance the modelingcapability of acoustic events. Furthermore, we reveal that the masking strategyis critical in audio SSL pre-training, and superior audio representations canbe obtained with large inverse block masks. Experiment results demonstrate thatEAT achieves state-of-the-art (SOTA) performance on a range of audio-relatedtasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with asignificant pre-training speedup up to ~15x compared to existing audio SSLmodels.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-classification-on-audioset | EAT | Test mAP: 0.486 |
| audio-classification-on-balanced-audio-set | EAT | Mean AP: 40.3 |
| audio-classification-on-esc-50 | EAT | Accuracy (5-fold): 96.0 PRE-TRAINING DATASET: AudioSet Top-1 Accuracy: 96.0 |
| audio-classification-on-speech-commands-1 | EAT | Accuracy: 98.3±0.04 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.