8 months ago

Abstract

Self-supervised learning (SSL) has emerged as a popular approach for learningaudio representations. One goal of audio self-supervised pre-training is totransfer knowledge to downstream audio tasks, generally including clip-leveland frame-level tasks. While frame-level tasks are important for fine-grainedacoustic scene/event understanding, prior studies primarily evaluate onclip-level downstream tasks. In order to tackle both clip-level and frame-leveltasks, this paper proposes Audio Teacher-Student Transformer (ATST), with aclip-level version (named ATST-Clip) and a frame-level version (namedATST-Frame), responsible for learning clip-level and frame-levelrepresentations, respectively. Both methods use a Transformer encoder and ateacher-student training scheme. We have carefully designed the view creationstrategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip usessegment-wise data augmentations, and ATST-Frame integrates frame-wise dataaugmentations and masking. Experimental results show that our ATST-Frame modelobtains state-of-the-art (SOTA) performances on most of the clip-level andframe-level downstream tasks. Especially, it outperforms other models by alarge margin on the frame-level sound event detection task. In addition, theperformance can be further improved by combining the two models throughknowledge distillation. Our code is available online.

Source PDF