6 months ago

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He

Abstract

Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He4 more

Abstract

Build AI with AI

HyperAI Newsletters

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He

Xu Tan Jiawei Chen Haohe Liu Jian Cong Chen Zhang Yanqing Liu Xi Wang Yichong Leng Yuanhao Yi Lei He