5 months ago

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Chen Wenxi ; Liang Yuzhe ; Ma Ziyang ; Zheng Zhisheng ; Chen Xie

Abstract

Audio self-supervised learning (SSL) pre-training, which aims to learn goodrepresentations from unlabeled audio, has made remarkable progress. However,the extensive computational demands during pre-training pose a significantbarrier to the potential application and optimization of audio SSL models. Inthis paper, inspired by the success of data2vec 2.0 in image modality andAudio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) tofurther improve the effectiveness and efficiency in audio SSL. The proposed EATadopts the bootstrap self-supervised training paradigm to the audio domain. Anovel Utterance-Frame Objective (UFO) is designed to enhance the modelingcapability of acoustic events. Furthermore, we reveal that the masking strategyis critical in audio SSL pre-training, and superior audio representations canbe obtained with large inverse block masks. Experiment results demonstrate thatEAT achieves state-of-the-art (SOTA) performance on a range of audio-relatedtasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with asignificant pre-training speedup up to ~15x compared to existing audio SSLmodels.

Code Repositories

cwx-worst-one/eat

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
audio-classification-on-audioset	EAT	Test mAP: 0.486
audio-classification-on-balanced-audio-set	EAT	Mean AP: 40.3
audio-classification-on-esc-50	EAT	Accuracy (5-fold): 96.0 PRE-TRAINING DATASET: AudioSet Top-1 Accuracy: 96.0
audio-classification-on-speech-commands-1	EAT	Accuracy: 98.3±0.04

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette