Command Palette
Search for a command to run...
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
William Chan Daniel Park Chris Lee Yu Zhang Quoc Le Mohammad Norouzi

Abstract
We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-recognition-on-ami-imh | SpeechStew (100M) | Word Error Rate (WER): 9 |
| speech-recognition-on-ami-sdm1 | SpeechStew (100M) | Word Error Rate (WER): 21.7 |
| speech-recognition-on-chime-6-dev-gss12 | SpeechStew (1B) | Word Error Rate (WER): 31.9 |
| speech-recognition-on-chime-6-eval | SpeechStew (1B) | Word Error Rate (WER): 38.9 |
| speech-recognition-on-common-voice-2 | SpeechStew (1B) | Test WER: 10.8% |
| speech-recognition-on-librispeech-test-clean | SpeechStew (1B) | Word Error Rate (WER): 1.7 |
| speech-recognition-on-librispeech-test-clean | SpeechStew (100M) | Word Error Rate (WER): 2.0 |
| speech-recognition-on-librispeech-test-other | SpeechStew (1B) | Word Error Rate (WER): 3.3 |
| speech-recognition-on-librispeech-test-other | SpeechStew (100M) | Word Error Rate (WER): 4.0 |
| speech-recognition-on-switchboard-callhome | SpeechStew (100M) | Word Error Rate (WER): 8.3 |
| speech-recognition-on-switchboard-swbd | SpeechStew (100M) | Word Error Rate (WER): 4.7 |
| speech-recognition-on-tedlium | SpeechStew (100M) | Word Error Rate (WER): 5.3 |
| speech-recognition-on-wsj-eval92 | Speechstew 100M | Word Error Rate (WER): 1.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.