
摘要
近年来,语音识别模型通常需要大量硬件资源,且主要在英语语料上进行训练。本文提出了一种适用于德语、西班牙语和法语的语音识别模型,具备以下独特特性:(a)模型体积小巧,可在树莓派等微控制器上实现实时运行;(b)借助预训练的英语模型,仅需消费级硬件和相对较小的训练数据集即可完成训练;(c)在性能上可与现有解决方案相媲美,尤其在德语识别任务中表现更优。相较而言,现有方法仅具备本文所提出特性的部分组合,而本模型则综合了多项优势。此外,本文还发布了一个新的数据集处理库,该库设计注重可扩展性,便于轻松集成新的数据集,并提出了一种优化的跨语言迁移学习方法:利用具有相似字母系统的另一语言的预训练模型,高效迁移至新语言。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| speech-recognition-on-common-voice-french | QuartzNet15x5FR (CV-only) | Test WER: 12.1% |
| speech-recognition-on-common-voice-french | ConformerCTC-L (5-gram) | Test WER: 8.13% |
| speech-recognition-on-common-voice-french | ConformerCTC-L (no-LM) | Test WER: 10.19 % |
| speech-recognition-on-common-voice-french | QuartzNet15x5FR (D7) | Test WER: 11.0% |
| speech-recognition-on-common-voice-german | QuartzNet15x5DE (D37, 5-gram) | Test CER: 2.7% Test WER: 6.6% |
| speech-recognition-on-common-voice-german | ConformerCTC-L (5-gram) | Test CER: 1.37% Test WER: 4.05% |
| speech-recognition-on-common-voice-german | QuartzNet15x5DE (CV-only, 5-gram) | Test CER: 3.2% Test WER: 7.7% |
| speech-recognition-on-common-voice-german | ConformerCTC-L (no LM) | Test CER: 2.05% Test WER: 7.33% |
| speech-recognition-on-common-voice-italian | QuartzNet15x5IT (D5) | Test WER: 11.5% |
| speech-recognition-on-common-voice-spanish | QuartzNet15x5ES (CV-only) | Test WER: 10.5% |
| speech-recognition-on-common-voice-spanish | ConformerCTC-L (5-gram) | Test WER: 5.68% |
| speech-recognition-on-common-voice-spanish | ConformerCTC-L (no-LM) | Test WER: 7.46 % |
| speech-recognition-on-common-voice-spanish | QuartzNet15x5ES (D8) | Test WER: 10.0% |
| speech-recognition-on-tuda | QuartzNet15x5DE (D37) | Test WER: 10.2% |