HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

William Chan Daniel Park Chris Lee Yu Zhang Quoc Le Mohammad Norouzi

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

Abstract

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.

Benchmarks

BenchmarkMethodologyMetrics
speech-recognition-on-ami-imhSpeechStew (100M)
Word Error Rate (WER): 9
speech-recognition-on-ami-sdm1SpeechStew (100M)
Word Error Rate (WER): 21.7
speech-recognition-on-chime-6-dev-gss12SpeechStew (1B)
Word Error Rate (WER): 31.9
speech-recognition-on-chime-6-evalSpeechStew (1B)
Word Error Rate (WER): 38.9
speech-recognition-on-common-voice-2SpeechStew (1B)
Test WER: 10.8%
speech-recognition-on-librispeech-test-cleanSpeechStew (1B)
Word Error Rate (WER): 1.7
speech-recognition-on-librispeech-test-cleanSpeechStew (100M)
Word Error Rate (WER): 2.0
speech-recognition-on-librispeech-test-otherSpeechStew (1B)
Word Error Rate (WER): 3.3
speech-recognition-on-librispeech-test-otherSpeechStew (100M)
Word Error Rate (WER): 4.0
speech-recognition-on-switchboard-callhomeSpeechStew (100M)
Word Error Rate (WER): 8.3
speech-recognition-on-switchboard-swbdSpeechStew (100M)
Word Error Rate (WER): 4.7
speech-recognition-on-tedliumSpeechStew (100M)
Word Error Rate (WER): 5.3
speech-recognition-on-wsj-eval92Speechstew 100M
Word Error Rate (WER): 1.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network | Papers | HyperAI