3 months ago

Wavesplit: End-to-End Speech Separation by Speaker Clustering

Neil Zeghidour David Grangier

Abstract

We introduce Wavesplit, an end-to-end source separation system. From a single mixture, the model infers a representation for each source and then estimates each source signal given the inferred representations. The model is trained to jointly perform both tasks from the raw waveform. Wavesplit infers a set of source representations via clustering, which addresses the fundamental permutation problem of separation. For speech separation, our sequence-wide speaker representations provide a more robust separation of long, challenging recordings compared to prior work. Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2/3mix), as well as in noisy and reverberated settings (WHAM/WHAMR). We also set a new benchmark on the recent LibriMix dataset. Finally, we show that Wavesplit is also applicable to other domains, by separating fetal and maternal heart rates from a single abdominal electrocardiogram.

Benchmarks

Benchmark	Methodology	Metrics
speech-separation-on-whamr	Wavesplit	SI-SDRi: 13.2
speech-separation-on-wsj0-2mix	Wavesplit v2	SDRi: 22.3 SI-SDRi: 22.2
speech-separation-on-wsj0-2mix	Wavesplit v1	SI-SDRi: 19.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning