3 months ago

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

Shengkui Zhao Bin Ma

Abstract

Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit{MossFormer} (\textit{Mo}naural \textit{s}peech \textit{s}eparation Trans\textit{Former}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental interaction directly. In addition, we employ a powerful attentive gating mechanism with simplified single-head self-attentions. Besides the attentive long-range modelling, we also augment MossFormer with convolutions for the position-wise local pattern modelling. As a consequence, MossFormer significantly outperforms the previous models and achieves the state-of-the-art results on WSJ0-2/3mix and WHAM!/WHAMR! benchmarks. Our model achieves the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix and only 0.3 dB below the upper bound of 23.1 dB on WSJ0-2mix.

Code Repositories

alibabasglab/mossformer

pytorch

Mentioned in GitHub

modelscope/ClearerVoice-Studio

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
speech-separation-on-wham	MossFormer (L) + DM	SI-SDRi: 17.3
speech-separation-on-whamr	MossFormer (L) + DM	SI-SDRi: 16.3
speech-separation-on-wsj0-2mix	MossFormer (L) + DM	MACs (G): 86.1 Number of parameters (M): 42.1 SI-SDRi: 22.8
speech-separation-on-wsj0-2mix	MossFormer (M) + DM	SI-SDRi: 22.5
speech-separation-on-wsj0-2mix-16k	MossFormer2	SI-SDRi: 20.5
speech-separation-on-wsj0-3mix	MossFormer (M) + DM	SI-SDRi: 20.8
speech-separation-on-wsj0-3mix	MossFormer (L) + DM	SI-SDRi: 21.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette