HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

Shengkui Zhao Bin Ma

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

Abstract

Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit{MossFormer} (\textit{Mo}naural \textit{s}peech \textit{s}eparation Trans\textit{Former}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental interaction directly. In addition, we employ a powerful attentive gating mechanism with simplified single-head self-attentions. Besides the attentive long-range modelling, we also augment MossFormer with convolutions for the position-wise local pattern modelling. As a consequence, MossFormer significantly outperforms the previous models and achieves the state-of-the-art results on WSJ0-2/3mix and WHAM!/WHAMR! benchmarks. Our model achieves the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix and only 0.3 dB below the upper bound of 23.1 dB on WSJ0-2mix.

Code Repositories

alibabasglab/mossformer
pytorch
Mentioned in GitHub
modelscope/ClearerVoice-Studio
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
speech-separation-on-whamMossFormer (L) + DM
SI-SDRi: 17.3
speech-separation-on-whamrMossFormer (L) + DM
SI-SDRi: 16.3
speech-separation-on-wsj0-2mixMossFormer (L) + DM
MACs (G): 86.1
Number of parameters (M): 42.1
SI-SDRi: 22.8
speech-separation-on-wsj0-2mixMossFormer (M) + DM
SI-SDRi: 22.5
speech-separation-on-wsj0-2mix-16kMossFormer2
SI-SDRi: 20.5
speech-separation-on-wsj0-3mixMossFormer (M) + DM
SI-SDRi: 20.8
speech-separation-on-wsj0-3mixMossFormer (L) + DM
SI-SDRi: 21.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions | Papers | HyperAI