HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Abstract

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

Code Repositories

whisperspeech/whisperspeech
pytorch
Mentioned in GitHub
gemelo-ai/vocos
Official
pytorch
Mentioned in GitHub
collabora/whisperspeech
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
speech-synthesis-on-librittsVocos
PESQ: 3.70
Periodicity: 0.101
V/UV F1: 0.9582

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis | Papers | HyperAI