Command Palette
Search for a command to run...
TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement
Kohei Saijo; Gordon Wichern; François G. Germain; Zexu Pan; Jonathan Le Roux

Abstract
Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-enhancement-on-deep-noise-suppression | TF-Locoformer (M) | FLOPS (G): 497.24 Number of parameters (M): 15 PESQ-WB: 3.72 SI-SDR-WB: 23.3 STOI: 98.8 |
| speech-separation-on-libri2mix | TF-Locoformer (M) | Number of parameters (M): 15 SDRi: 22.2 SI-SDRi: 22.1 |
| speech-separation-on-whamr | TF-Locoformer (M) | Number of parameters (M): 15 SDRi: 16.9 SI-SDRi: 18.5 |
| speech-separation-on-whamr | TF-Locoformer (S) | Number of parameters (M): 5 SDRi: 15.9 SI-SDRi: 17.4 |
| speech-separation-on-wsj0-2mix | TF-Locoformer (S) + DM | Number of parameters (M): 5.0 SDRi: 23 SI-SDRi: 22.8 |
| speech-separation-on-wsj0-2mix | TF-Locoformer (M) | Number of parameters (M): 15.0 SDRi: 23.8 SI-SDRi: 23.6 |
| speech-separation-on-wsj0-2mix | TF-Locoformer (L) + DM | Number of parameters (M): 22.5 SDRi: 25.2 SI-SDRi: 25.1 |
| speech-separation-on-wsj0-2mix | TF-Locoformer (S) | Number of parameters (M): 5.0 SDRi: 22.1 SI-SDRi: 22 |
| speech-separation-on-wsj0-2mix | TF-Locoformer (M) + DM | Number of parameters (M): 15.0 SDRi: 24.7 SI-SDRi: 24.6 |
| speech-separation-on-wsj0-2mix | TF-Locoformer (L) | Number of parameters (M): 22.5 SDRi: 24.3 SI-SDRi: 24.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.