HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Younglo Lee Shukjae Choi Byeong-Yeol Kim Zhong-Qiu Wang Shinji Watanabe

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Abstract

We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.

Benchmarks

BenchmarkMethodologyMetrics
speech-separation-on-wsj0-2mixSepTDA (L=12)
SI-SDRi: 24.0
speech-separation-on-wsj0-3mixSepTDA
SI-SDRi: 23.7
speech-separation-on-wsj0-4mixSepTDA
SI-SDRi: 22.0
speech-separation-on-wsj0-5mixSepTDA
SI-SDRi: 21.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor | Papers | HyperAI