Command Palette
Search for a command to run...
Scribosermo: Fast Speech-to-Text models for German and other Languages
Daniel Bermuth Alexander Poeppel Wolfgang Reif

Abstract
Recent Speech-to-Text models often require a large amount of hardware resources and are mostly trained in English. This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features: (a) They are small and run in real-time on microcontrollers like a RaspberryPi. (b) Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset. (c) The models are competitive with other solutions and outperform them in German. In this respect, the models combine advantages of other approaches, which only include a subset of the presented features. Furthermore, the paper provides a new library for handling datasets, which is focused on easy extension with additional datasets and shows an optimized way for transfer-learning new languages using a pretrained model from another language with a similar alphabet.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-recognition-on-common-voice-french | QuartzNet15x5FR (CV-only) | Test WER: 12.1% |
| speech-recognition-on-common-voice-french | ConformerCTC-L (5-gram) | Test WER: 8.13% |
| speech-recognition-on-common-voice-french | ConformerCTC-L (no-LM) | Test WER: 10.19 % |
| speech-recognition-on-common-voice-french | QuartzNet15x5FR (D7) | Test WER: 11.0% |
| speech-recognition-on-common-voice-german | QuartzNet15x5DE (D37, 5-gram) | Test CER: 2.7% Test WER: 6.6% |
| speech-recognition-on-common-voice-german | ConformerCTC-L (5-gram) | Test CER: 1.37% Test WER: 4.05% |
| speech-recognition-on-common-voice-german | QuartzNet15x5DE (CV-only, 5-gram) | Test CER: 3.2% Test WER: 7.7% |
| speech-recognition-on-common-voice-german | ConformerCTC-L (no LM) | Test CER: 2.05% Test WER: 7.33% |
| speech-recognition-on-common-voice-italian | QuartzNet15x5IT (D5) | Test WER: 11.5% |
| speech-recognition-on-common-voice-spanish | QuartzNet15x5ES (CV-only) | Test WER: 10.5% |
| speech-recognition-on-common-voice-spanish | ConformerCTC-L (5-gram) | Test WER: 5.68% |
| speech-recognition-on-common-voice-spanish | ConformerCTC-L (no-LM) | Test WER: 7.46 % |
| speech-recognition-on-common-voice-spanish | QuartzNet15x5ES (D8) | Test WER: 10.0% |
| speech-recognition-on-tuda | QuartzNet15x5DE (D37) | Test WER: 10.2% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.