Command Palette
Search for a command to run...
Awni Hannun; Carl Case; Jared Casper; Bryan Catanzaro; Greg Diamos; Erich Elsen; Ryan Prenger; Sanjeev Satheesh; Shubho Sengupta; Adam Coates; Andrew Y. Ng

Abstract
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| accented-speech-recognition-on-voxforge | Deep Speech | Percentage error: 45.35 |
| accented-speech-recognition-on-voxforge-1 | Deep Speech | Percentage error: 28.46 |
| accented-speech-recognition-on-voxforge-2 | Deep Speech | Percentage error: 31.20 |
| accented-speech-recognition-on-voxforge-3 | Deep Speech | Percentage error: 15.01 |
| noisy-speech-recognition-on-chime-clean | CNN + Bi-RNN + CTC (speech to letters) | Percentage error: 6.3 |
| noisy-speech-recognition-on-chime-real | CNN + Bi-RNN + CTC (speech to letters) | Percentage error: 67.94 |
| speech-recognition-on-swb_hub_500-wer | CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB | Percentage error: 16 |
| speech-recognition-on-switchboard-hub500 | Deep Speech + FSH | Percentage error: 12.6 |
| speech-recognition-on-switchboard-hub500 | CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB | Percentage error: 12.6 |
| speech-recognition-on-switchboard-hub500 | Deep Speech | Percentage error: 20 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.