Michael HassidTal RemezTu Anh NguyenItai GatAlexis ConneauFelix KreukJade CopetAlexandre DefossezGabriel SynnaeveEmmanuel DupouxRoy SchwartzYossi Adi

摘要
语音语言模型(Speech Language Models, SpeechLMs)仅处理和生成声学数据,不依赖文本监督信号。在本工作中,我们提出TWIST方法,通过利用预训练文本语言模型进行热启动(warm-start),来训练SpeechLMs。实验结果表明,无论是通过自动评估还是人工评估,TWIST在各项指标上均显著优于冷启动(cold-start)的SpeechLM。我们对不同模型设计选择的影响进行了实证分析,包括语音分词器(speech tokenizer)、预训练文本模型以及训练数据规模等因素。研究发现,模型规模与数据规模均在构建高性能SpeechLMs中发挥着至关重要的作用。基于上述观察,我们构建了目前已知参数量最大、训练数据量最大的SpeechLM模型。此外,我们还引入了两个口语化版本的StoryCloze文本基准测试,以进一步提升模型评估的可靠性,并推动该领域的后续研究发展。相关语音样本、代码及模型均已公开发布,访问地址为:https://pages.cs.huji.ac.il/adiyoss-lab/twist/。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| language-modelling-on-salmon | TWIST 1.3B | Background (Domain) Consistency: 55.5 Background (Random) Consistency: 60.5 Background Alignment: 56.5 Gender Consistency: 69.5 Room Consistency: 59.0 Sentiment Alignment: 53.0 Sentiment Consistency: 61.5 Speaker Consistency: 69.0 |
| language-modelling-on-salmon | TWIST 350M | Background (Domain) Consistency: 54.0 Background (Random) Consistency: 61.5 Background Alignment: 56.5 Gender Consistency: 68.0 Room Consistency: 59.0 Sentiment Alignment: 51.5 Sentiment Consistency: 59.0 Speaker Consistency: 69.5 |
| language-modelling-on-salmon | TWIST 7B | Background (Domain) Consistency: 55.0 Background (Random) Consistency: 60.5 Background Alignment: 54.5 Gender Consistency: 70.0 Room Consistency: 62.0 Sentiment Alignment: 51.5 Sentiment Consistency: 61.5 Speaker Consistency: 71.0 |