8 months ago

Abstract

Phase information has a significant impact on speech perceptual quality andintelligibility. However, existing speech enhancement methods encounterlimitations in explicit phase estimation due to the non-structural nature andwrapping characteristics of the phase, leading to a bottleneck in enhancedspeech quality. To overcome the above issue, in this paper, we proposedMP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitudeand Phase spectra in parallel. The proposed MP-SENet comprises aTransformer-embedded encoder-decoder architecture. The encoder aims to encodethe input distorted magnitude and phase spectra into time-frequencyrepresentations, which are further fed into time-frequency Transformers foralternatively capturing time and frequency dependencies. The decoder comprisesa magnitude mask decoder and a phase decoder, directly enhancing magnitude andwrapped phase spectra by incorporating a magnitude masking architecture and aphase parallel estimation architecture, respectively. Multi-level lossfunctions explicitly defined on the magnitude spectra, wrapped phase spectra,and short-time complex spectra are adopted to jointly train the MP-SENet model.A metric discriminator is further employed to compensate for the incompletecorrelation between these losses and human auditory perception. Experimentalresults demonstrate that our proposed MP-SENet achieves state-of-the-artperformance across multiple speech enhancement tasks, including speechdenoising, dereverberation, and bandwidth extension. Compared to existingphase-aware speech enhancement methods, it further mitigates the compensationeffect between the magnitude and phase by explicit phase estimation, elevatingthe perceptual quality of enhanced speech.

Source PDF