Command Palette
Search for a command to run...
Pham The Hieu ; Nguyen Phuong Thanh Tran ; Nguyen Xuan Tho ; Nguyen Tan Dat ; Nguyen Duc Dung

Abstract
The research on audio clue-based target speaker extraction (TSE) has mostlyfocused on modeling the mixture and reference speech, achieving highperformance in English due to the availability of large datasets. However, lessattention has been given to the consistent properties of human speech acrosslanguages. To bridge this gap, we introduce an alternative model whichaddresses the challenge of transferring TSE models from one language to anotherwithout fine-tuning. In this work, we proposed a gating mechanism that is ableto modify specific frequencies based on the speaker's acoustic features. Themodel achieves an SI-SDR of 17.3544 on clean English speech and 13.2032 onclean speech mixed with Wham! noise, outperforming all other models in itsability to adapt to different languages.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-separation-on-libri2mix | WHYV | SDR: 17.2458 SI-SDRi: 17.5 |
| speech-separation-on-wham | WHYV | SI-SDRi: 12.964 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.