Command Palette
Search for a command to run...
Ante Jukić; Roman Korostik; Jagadeesh Balam; Boris Ginsburg

Abstract
This paper proposes a generative speech enhancement model based on Schrödinger bridge (SB). The proposed model is employing a tractable SB to formulate a data-to-data process between the clean speech distribution and the observed noisy speech distribution. The model is trained with a data prediction loss, aiming to recover the complex-valued clean speech coefficients, and an auxiliary time-domain loss is used to improve training of the model. The effectiveness of the proposed SB-based model is evaluated in two different speech enhancement tasks: speech denoising and speech dereverberation. The experimental results demonstrate that the proposed SB-based outperforms diffusion-based models in terms of speech quality metrics and ASR performance, e.g., resulting in relative word error rate reduction of 20% for denoising and 6% for dereverberation compared to the best baseline model. The proposed model also demonstrates improved efficiency, achieving better quality than the baselines for the same number of sampling steps and with a reduced computational cost.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-enhancement-on-ears-wham | Schrödinger Bridge | DNSMOS: 3.83 ESTOI: 0.73 PESQ-WB: 2.33 POLQA: 3.46 SI-SDR: 17.85 SIGMOS: 3.44 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.