Command Palette
Search for a command to run...
Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
Lu Ye-Xin ; Ai Yang ; Ling Zhen-Hua

Abstract
Phase information has a significant impact on speech perceptual quality andintelligibility. However, existing speech enhancement methods encounterlimitations in explicit phase estimation due to the non-structural nature andwrapping characteristics of the phase, leading to a bottleneck in enhancedspeech quality. To overcome the above issue, in this paper, we proposedMP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitudeand Phase spectra in parallel. The proposed MP-SENet comprises aTransformer-embedded encoder-decoder architecture. The encoder aims to encodethe input distorted magnitude and phase spectra into time-frequencyrepresentations, which are further fed into time-frequency Transformers foralternatively capturing time and frequency dependencies. The decoder comprisesa magnitude mask decoder and a phase decoder, directly enhancing magnitude andwrapped phase spectra by incorporating a magnitude masking architecture and aphase parallel estimation architecture, respectively. Multi-level lossfunctions explicitly defined on the magnitude spectra, wrapped phase spectra,and short-time complex spectra are adopted to jointly train the MP-SENet model.A metric discriminator is further employed to compensate for the incompletecorrelation between these losses and human auditory perception. Experimentalresults demonstrate that our proposed MP-SENet achieves state-of-the-artperformance across multiple speech enhancement tasks, including speechdenoising, dereverberation, and bandwidth extension. Compared to existingphase-aware speech enhancement methods, it further mitigates the compensationeffect between the magnitude and phase by explicit phase estimation, elevatingthe perceptual quality of enhanced speech.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-enhancement-on-deep-noise-suppression | MP-SENet | PESQ-NB: 3.92 PESQ-WB: 3.62 SI-SDR-WB: 21.03 |
| speech-enhancement-on-demand | MP-SENet | CBAK: 3.99 COVL: 4.34 CSIG: 4.81 PESQ (wb): 3.60 Para. (M): 2.26 STOI: 0.96 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.