5 months ago

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Cai Pengfei ; Song Yan ; Li Kang ; Song Haoyu ; McLoughlin Ian

Abstract

Sound event detection (SED) methods that leverage a large pre-trainedTransformer encoder network have shown promising performance in recent DCASEchallenges. However, they still rely on an RNN-based context network to modeltemporal dependencies, largely due to the scarcity of labeled data. In thiswork, we propose a pure Transformer-based SED model with masked-reconstructionbased pre-training, termed MAT-SED. Specifically, a Transformer with relativepositional encoding is first designed as the context network, pre-trained bythe masked-reconstruction task on all available target data in aself-supervised way. Both the encoder and the context network are jointlyfine-tuned in a semi-supervised manner. Furthermore, a global-local featurefusion strategy is proposed to enhance the localization capability. Evaluationof MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving0.587/0.896 PSDS1/PSDS2 respectively.

Code Repositories

cai525/transformer4sed

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
sound-event-detection-on-desed	MAT-SED	PSDS1: 0.587 PSDS2: 0.896

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette