Zengwei YaoWei KangXiaoyu YangFangjun KuangLiyong GuoHan ZhuZengrui JinZhaoqing LiLong LinDaniel Povey

摘要
连接时序分类(Connectionist Temporal Classification, CTC)是一种广泛应用于自动语音识别(ASR)领域的经典方法,以其结构简单和计算高效而著称。然而,其识别性能往往受限。本文提出了一种一致性正则化CTC(Consistency-Regularized CTC, CR-CTC)方法,通过强制对输入语音梅尔频谱图的不同增强视图所生成的两个CTC分布之间保持一致性,从而提升模型性能。我们从三个角度深入分析了该方法的核心机制:1)在处理不同增强视图的随机子模型对之间进行自蒸馏(self-distillation);2)通过在时间掩码区域内的位置进行掩码预测,学习上下文表征,尤其在增加时间掩码比例时效果更为显著;3)抑制CTC输出分布中极端尖锐的峰值,有效缓解过拟合问题,提升模型的泛化能力。在LibriSpeech、Aishell-1和GigaSpeech等多个公开数据集上的大量实验表明,所提出的CR-CTC方法具有显著有效性。其在CTC框架下的性能大幅提升,达到与基于转换器(transducer)或结合CTC与基于注意力的编码器-解码器结构(CTC/AED)相当的先进水平。相关代码已开源,地址为:https://github.com/k2-fsa/icefall。
代码仓库
k2-fsa/icefall
官方
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| speech-recognition-on-aishell-1 | Zipformer+CR-CTC (no external language model) | Params(M): 66.2 Word Error Rate (WER): 4.02 |
| speech-recognition-on-gigaspeech-dev | Zipformer+pruned transducer w/ CR-CTC (no external language model) | Word Error Rate (WER): 9.95 |
| speech-recognition-on-gigaspeech-dev | Zipformer+CR-CTC (no external language model) | Word Error Rate (WER): 10.15 |
| speech-recognition-on-gigaspeech-dev | Zipformer+pruned transducer (no external language model) | Word Error Rate (WER): 10.09 |
| speech-recognition-on-gigaspeech-test | Zipformer+CR-CTC (no external language model) | Word Error Rate (WER): 10.28 |
| speech-recognition-on-gigaspeech-test | Zipformer+CR-CTC/AED (no external language model) | Word Error Rate (WER): 10.07 |
| speech-recognition-on-gigaspeech-test | Zipformer+pruned transducer w/ CR-CTC (no external language model) | Word Error Rate (WER): 10.03 |
| speech-recognition-on-gigaspeech-test | Zipformer+pruned transducer (no external language model) | Word Error Rate (WER): 10.2 |
| speech-recognition-on-librispeech-test-clean | Zipformer+CR-CTC (no external language model) | Word Error Rate (WER): 2.02 |
| speech-recognition-on-librispeech-test-clean | Zipformer+pruned transducer w/ CR-CTC (no external language model) | Word Error Rate (WER): 1.88 |
| speech-recognition-on-librispeech-test-other | Zipformer+pruned transducer w/ CR-CTC (no external language model) | Word Error Rate (WER): 3.95 |
| speech-recognition-on-librispeech-test-other | Zipformer+CR-CTC (no external language model) | Word Error Rate (WER): 4.35 |