Command Palette
Search for a command to run...
MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions
Shengkui Zhao Bin Ma

Abstract
Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit{MossFormer} (\textit{Mo}naural \textit{s}peech \textit{s}eparation Trans\textit{Former}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental interaction directly. In addition, we employ a powerful attentive gating mechanism with simplified single-head self-attentions. Besides the attentive long-range modelling, we also augment MossFormer with convolutions for the position-wise local pattern modelling. As a consequence, MossFormer significantly outperforms the previous models and achieves the state-of-the-art results on WSJ0-2/3mix and WHAM!/WHAMR! benchmarks. Our model achieves the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix and only 0.3 dB below the upper bound of 23.1 dB on WSJ0-2mix.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-separation-on-wham | MossFormer (L) + DM | SI-SDRi: 17.3 |
| speech-separation-on-whamr | MossFormer (L) + DM | SI-SDRi: 16.3 |
| speech-separation-on-wsj0-2mix | MossFormer (L) + DM | MACs (G): 86.1 Number of parameters (M): 42.1 SI-SDRi: 22.8 |
| speech-separation-on-wsj0-2mix | MossFormer (M) + DM | SI-SDRi: 22.5 |
| speech-separation-on-wsj0-2mix-16k | MossFormer2 | SI-SDRi: 20.5 |
| speech-separation-on-wsj0-3mix | MossFormer (M) + DM | SI-SDRi: 20.8 |
| speech-separation-on-wsj0-3mix | MossFormer (L) + DM | SI-SDRi: 21.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.