6 months ago

Abstract

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-best hypotheses in a streaming way. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. This efficient rescoring process causes very little sentence-level latency. Our experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently. On the AISHELL-1 test set, our unified model achieves 5.60% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. The same model achieves 5.42% CER with 640ms latency in a streaming ASR system.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 months ago

Audio and Speech Processing

Binbin Zhang Di Wu Zhuoyuan Yao Xiong Wang Fan Yu Chao Yang Liyong Guo Yaguang Hu Lei Xie Xin Lei

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 months ago

Audio and Speech Processing

Binbin Zhang Di Wu Zhuoyuan Yao Xiong Wang Fan Yu Chao Yang Liyong Guo Yaguang Hu Lei Xie Xin Lei

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

Binbin Zhang Di Wu Zhuoyuan Yao Xiong Wang Fan Yu Chao Yang Liyong Guo Yaguang Hu Lei Xie Xin Lei

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

Binbin Zhang Di Wu Zhuoyuan Yao Xiong Wang Fan Yu Chao Yang Liyong Guo Yaguang Hu Lei Xie Xin Lei

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

Binbin Zhang Di Wu Zhuoyuan Yao Xiong Wang Fan Yu Chao Yang Liyong Guo Yaguang Hu Lei Xie Xin Lei

Abstract

Build AI with AI

HyperAI Newsletters