HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Kohei Saijo; Gordon Wichern; François G. Germain; Zexu Pan; Jonathan Le Roux

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Abstract

Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
speech-enhancement-on-deep-noise-suppressionTF-Locoformer (M)
FLOPS (G): 497.24
Number of parameters (M): 15
PESQ-WB: 3.72
SI-SDR-WB: 23.3
STOI: 98.8
speech-separation-on-libri2mixTF-Locoformer (M)
Number of parameters (M): 15
SDRi: 22.2
SI-SDRi: 22.1
speech-separation-on-whamrTF-Locoformer (M)
Number of parameters (M): 15
SDRi: 16.9
SI-SDRi: 18.5
speech-separation-on-whamrTF-Locoformer (S)
Number of parameters (M): 5
SDRi: 15.9
SI-SDRi: 17.4
speech-separation-on-wsj0-2mixTF-Locoformer (S) + DM
Number of parameters (M): 5.0
SDRi: 23
SI-SDRi: 22.8
speech-separation-on-wsj0-2mixTF-Locoformer (M)
Number of parameters (M): 15.0
SDRi: 23.8
SI-SDRi: 23.6
speech-separation-on-wsj0-2mixTF-Locoformer (L) + DM
Number of parameters (M): 22.5
SDRi: 25.2
SI-SDRi: 25.1
speech-separation-on-wsj0-2mixTF-Locoformer (S)
Number of parameters (M): 5.0
SDRi: 22.1
SI-SDRi: 22
speech-separation-on-wsj0-2mixTF-Locoformer (M) + DM
Number of parameters (M): 15.0
SDRi: 24.7
SI-SDRi: 24.6
speech-separation-on-wsj0-2mixTF-Locoformer (L)
Number of parameters (M): 22.5
SDRi: 24.3
SI-SDRi: 24.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement | Papers | HyperAI