摘要
本文研究了利用自动收集的网络音频数据进行语音语言识别任务的可行性。我们基于107种语言的特定维基百科数据生成半随机搜索关键词,并以此从YouTube平台检索相关视频。通过语音活动检测(Speech Activity Detection)与说话人分离(Speaker Diarization)技术,从视频中提取出包含语音的片段。随后采用后处理过滤机制,剔除那些极可能不属于目标语言的片段,经众包验证后,正确标注片段的比例提升至98%。由此构建的训练数据集(VoxLingua107)总时长达6628小时,平均每种语言约62小时,同时配套提供包含1609个经验证语音片段的评估集。我们利用该数据集构建了多种语音语言识别模型,针对不同的口语语言识别任务进行了实验。结果表明,使用自动获取的训练数据所取得的性能,可与使用人工标注的专有数据集相媲美。该数据集已对公众开放。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| spoken-language-identification-on | Cleaned | 0..5sec: 13.4 5..20sec: 6.6 Average: 7.6 |
| spoken-language-identification-on | Noisy | 0..5sec: 12.3 5..20sec: 6.1 Average: 7.1 |
| spoken-language-identification-on-kalaka-3 | Model on the automatically filtered (cleaned) data | EC: 0.022 EO: 0.058 PC: 0.041 PO: 0.056 |
| spoken-language-identification-on-kalaka-3 | Model on the noisy data | EC: 0.033 EO: 0.059 PC: 0.055 PO: 0.083 |
| spoken-language-identification-on-lre07 | Fusion of models | 10 sec: 4.54 3 sec: 15.29 30 sec: 1.30 Average: 7.04 |
| spoken-language-identification-on-lre07 | CNN-SAP | 10 sec: 2.49 3 sec: 8.59 30 sec: 1.09 Average: 4.06 |
| spoken-language-identification-on-lre07 | GMM-MMI | 10 sec: 5.90 3 sec: 17.28 30 sec: 2.10 Average: 8.42 |
| spoken-language-identification-on-lre07 | Phonotactic | 10 sec: 6.28 3 sec: 18.59 30 sec: 1.34 Average: 8.73 |
| spoken-language-identification-on-lre07 | Kaldi i-vector | 10 sec: 11.93 3 sec: 26.04 30 sec: 4.52 Average: 14.17 |
| spoken-language-identification-on-lre07 | Kaldi i-vector DNN | 10 sec: 7.84 3 sec: 19.67 30 sec: 3.31 Average: 10.27 |
| spoken-language-identification-on-lre07 | CNN-LDE | 10 sec: 2.61 3 sec: 8.25 30 sec: 1.16 Average: 4.00 |
| spoken-language-identification-on-lre07 | Resnet34 (cleaned data) | 10 sec: 3.14 3 sec: 9.39 30 sec: 1.90 Average: 4.81 |
| spoken-language-identification-on-lre07 | Resnet34 (noisy data) | 10 sec: 3.33 3 sec: 10.58 30 sec: 1.72 Average: 5.21 |