
摘要
语音处理领域的进展得益于共享数据集与基准测试的推动。历史上,这些数据集和基准主要聚焦于自动语音识别(ASR)、说话人识别等底层任务。近年来,研究界对更高层次的口语理解任务(如端到端建模)的兴趣日益增长,但针对此类任务的标注数据集仍然相对匮乏。与此同时,近期研究表明,通过预训练通用表示模型,并在少量标注数据上进行微调,即可在多个任务上取得良好效果。为此,我们提出构建一套名为“口语理解评估”(Spoken Language Understanding Evaluation, SLUE)的基准任务体系,其包含规模有限的标注训练集及对应的评估集。该资源将使研究社区能够追踪技术进展,评估预训练表示在高层次任务中的适用性,并深入探讨诸如流水线式方法与端到端方法孰优孰劣等开放性问题。本文介绍了SLUE基准体系的第一阶段,涵盖命名实体识别、情感分析以及在相应数据集上的自动语音识别任务。我们重点关注自然生成的语音(而非朗读或合成语音),并采用公开可获取的数据集。我们在VoxCeleb和VoxPopuli数据集的子集上提供了新的转写文本与标注信息,给出了评估指标与基线模型的性能结果,并开源了一个工具包,支持复现基线模型及评估新模型。
代码仓库
asappresearch/slue-toolkit
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| named-entity-recognition-on-slue | W2V2-L-LL60K (pipeline approach, uses LM) | F1 (%): 69.6 Text model: DeBERTa-L label-F1 (%): 82.2 |
| named-entity-recognition-on-slue | W2V2-B-LS960 (pipeline approach) | F1 (%): 49.5 Text model: DeBERTa-L label-F1 (%): 74.2 |
| named-entity-recognition-on-slue | HuBERT-B-LS960 (e2e approach, uses LM) | F1 (%): 61.9 Text model: N/A label-F1 (%): 70.3 |
| named-entity-recognition-on-slue | W2V2-B-LS960 (e2e approach) | F1 (%): 50.2 Text model: - label-F1 (%): 64.0 |
| named-entity-recognition-on-slue | W2V2-L-LL60K (e2e approach, uses LM) | F1 (%): 64.8 Text model: N/A label-F1 (%): 73.3 |
| named-entity-recognition-on-slue | W2V2-L-LL60K (pipeline approach) | F1 (%): 57.8 Text model: DeBERTa-L label-F1 (%): 78.8 |
| named-entity-recognition-on-slue | W2V2-B-VP100K (e2e approach, uses LM) | F1 (%): 61.8 Text model: N/A label-F1 (%): 69.8 |
| named-entity-recognition-on-slue | HuBERT-B-LS960 (e2e approach) | F1 (%): 49.8 Text model: - label-F1 (%): 62.9 |
| named-entity-recognition-on-slue | W2V2-B-LS960 (pipeline approach, uses LM) | F1 (%): 68.0 Text model: DeBERTa-L label-F1 (%): 79.8 |
| named-entity-recognition-on-slue | W2V2-L-LL60K (e2e approach) | F1 (%): 50.9 Text model: - label-F1 (%): 64.7 |
| named-entity-recognition-on-slue | W2V2-B-VP100K (e2e approach) | F1 (%): 47.9 Text model: - label-F1 (%): 60.8 |
| named-entity-recognition-on-slue | W2V2-B-LS960 (e2e approach, uses LM) | F1 (%): 63.4 Text model: N/A label-F1 (%): 71.7 |
| sentiment-analysis-on-slue | W2V2-L-LL60K (pipeline approach) | - |
| sentiment-analysis-on-slue | W2V2-L-LL60K (pipeline approach, uses LM) | - |
| sentiment-analysis-on-slue | W2V2-B-LS960 (pipeline approach, uses LM) | - |
| sentiment-analysis-on-slue | HuBERT-B-LS960 (e2e approach) | - |
| sentiment-analysis-on-slue | W2V2-B-LS960 (pipeline approach) | - |
| sentiment-analysis-on-slue | W2V2-B-LS960 (e2e approach) | - |
| sentiment-analysis-on-slue | W2V2-L-LL60K (e2e approach) | - |
| sentiment-analysis-on-slue | W2V2-B-VP100K (e2e approach) | - |
| speech-recognition-on-slue | W2V2-L-LL60K (+ in-domain LM) | VoxCeleb (Dev): 11.8 VoxCeleb (Test): 13.8 VoxPopuli (Dev): 12.0 VoxPopuli (Test): 12.5 |
| speech-recognition-on-slue | W2V2-L-LL60K (+ TED-LIUM 3 LM) | VoxCeleb (Dev): 9.1 VoxCeleb (Test): 10.8 VoxPopuli (Dev): 9.1 VoxPopuli (Test): 9.3 |
| speech-recognition-on-slue | W2V2-B-LS960 (+ in-domain LM) | VoxCeleb (Dev): 15.2 VoxCeleb (Test): 18.2 VoxPopuli (Dev): 14.6 VoxPopuli (Test): 15.2 |
| speech-recognition-on-slue | W2V2-B-VP100K | VoxCeleb (Dev): 29.9 VoxCeleb (Test): 33.4 VoxPopuli (Dev): 21.6 VoxPopuli (Test): 22.4 |
| speech-recognition-on-slue | HuBERT-B-LS960 | VoxCeleb (Dev): 19.6 VoxCeleb (Test): 21.2 VoxPopuli (Dev): 18.6 VoxPopuli (Test): 19.1 |
| speech-recognition-on-slue | W2V2-L-LL60K | VoxCeleb (Dev): 11.0 VoxCeleb (Test): 13.5 VoxPopuli (Dev): 14.0 VoxPopuli (Test): 12.1 |
| speech-recognition-on-slue | W2V2-B-LS960 (+ TED-LIUM 3 LM) | VoxCeleb (Dev): 13.2 VoxCeleb (Test): 15.8 VoxPopuli (Dev): 12.0 VoxPopuli (Test): 12.2 |
| speech-recognition-on-slue | W2V2-B-LS960 | VoxCeleb (Dev): 17.2 VoxCeleb (Test): 20.5 VoxPopuli (Dev): 17.2 VoxPopuli (Test): 17.9 |