4 个月前

歌词转录面向人类:一项注重可读性的基准测试

歌词转录面向人类:一项注重可读性的基准测试

摘要

为人类阅读而记录歌词不仅需要准确捕捉词序,还必须加入标点符号和格式以提高清晰度并传达上下文信息。这包括歌曲结构、情感强调以及主唱与和声之间的对比。尽管自动歌词转录(Automatic Lyrics Transcription, ALT)系统已经超越了仅生成无结构的词串,能够利用更广泛的上下文,但ALT基准测试尚未跟上这一进步的步伐,仍然专注于单词本身。为了弥补这一差距,我们引入了Jam-ALT,一个全面的歌词转录基准测试。该基准测试对JamendoLyrics数据集进行了彻底修订,遵循行业标准进行歌词转录和格式化,并设计了评估指标以捕捉和评估特定于歌词的细微差别,为提高歌词可读性奠定了基础。我们将该基准应用于最近的转录系统,并提供了额外的错误分析,以及与古典音乐数据集的实验对比。

代码仓库

基准测试

基准方法指标
automatic-lyrics-transcription-on-jam-altWhisper v3 +demucs
Case-Sensitive Word Error Rate: 51.6
Line break F1: 65.7
Punctuation F1: 33.0
Word Error Rate (WER): 48.0
automatic-lyrics-transcription-on-jam-altWhisper v2 +demucs
Case-Sensitive Word Error Rate: 49.8
Line break F1: 61.2
Punctuation F1: 41.6
Word Error Rate (WER): 44.5
automatic-lyrics-transcription-on-jam-altWhisper v3
Case-Sensitive Word Error Rate: 39.7
Line break F1: 73.5
Punctuation F1: 43.0
Section break F1: 1.0
Word Error Rate (WER): 35.5
automatic-lyrics-transcription-on-jam-altOWSM v3.1 +demucs +lang
Case-Sensitive Word Error Rate: 72.6
Line break F1: 41.1
Parenthesis F-1: 0.0
Punctuation F1: 20.0
Word Error Rate (WER): 66.5
automatic-lyrics-transcription-on-jam-altWhisper v3 +demucs +lang
Case-Sensitive Word Error Rate: 50.4
Line break F1: 65.8
Punctuation F1: 33.7
Word Error Rate (WER): 46.6
automatic-lyrics-transcription-on-jam-altWhisper v2
Case-Sensitive Word Error Rate: 42.1
Line break F1: 69.3
Punctuation F1: 44.2
Section break F1: 3.3
Word Error Rate (WER): 37.8
automatic-lyrics-transcription-on-jam-altWhisper v3 +lang
Case-Sensitive Word Error Rate: 37.2
Line break F1: 73.9
Punctuation F1: 43.7
Section break F1: 0.6
Word Error Rate (WER): 32.6
automatic-lyrics-transcription-on-jam-altAudioShake v3
Case-Sensitive Word Error Rate: 20.1
Line break F1: 84.4
Parenthesis F-1: 29.4
Punctuation F1: 57.0
Section break F1: 73.9
Word Error Rate (WER): 16.1
automatic-lyrics-transcription-on-jam-altWhisper v2 +lang
Case-Sensitive Word Error Rate: 32.6
Line break F1: 70.4
Punctuation F1: 45.0
Section break F1: 3.7
Word Error Rate (WER): 27.9
automatic-lyrics-transcription-on-jam-altOWSM v3.1 +lang
Case-Sensitive Word Error Rate: 75.0
Line break F1: 37.8
Parenthesis F-1: 0.6
Punctuation F1: 22.5
Word Error Rate (WER): 69.3
automatic-lyrics-transcription-on-jam-altWhisper v2 +demucs +lang
Case-Sensitive Word Error Rate: 39.3
Line break F1: 60.6
Punctuation F1: 39.4
Word Error Rate (WER): 33.5
automatic-lyrics-transcription-on-jam-alt-1Whisper v3
Case-Sensitive Word Error Rate: 42.5
Line break F-1: 71.5
Punctuation F-1: 41.4
Section break F-1: 2.6
Word Error Rate (WER): 37.7
automatic-lyrics-transcription-on-jam-alt-1Whisper v3 +demucs
Case-Sensitive Word Error Rate: 47.2
Line break F-1: 66.9
Punctuation F-1: 25.8
Word Error Rate (WER): 43.0
automatic-lyrics-transcription-on-jam-alt-1Whisper v3 +lang
Case-Sensitive Word Error Rate: 41.4
Line break F-1: 72.5
Punctuation F-1: 41.8
Section break F-1: 2.6
Word Error Rate (WER): 36.4
automatic-lyrics-transcription-on-jam-alt-1Whisper v2 +demucs +lang
Case-Sensitive Word Error Rate: 41.3
Line break F-1: 53.4
Punctuation F-1: 41.8
Word Error Rate (WER): 35.6
automatic-lyrics-transcription-on-jam-alt-1Whisper v2 +lang
Case-Sensitive Word Error Rate: 43.7
Line break F-1: 65.5
Punctuation F-1: 34.9
Section break F-1: 11.6
Word Error Rate (WER): 39.7
automatic-lyrics-transcription-on-jam-alt-1OWSM v3.1 +demucs +lang
Case-Sensitive Word Error Rate: 69.4
Line break F-1: 47.3
Parenthesis F-1: 0.0
Punctuation F-1: 21.5
Word Error Rate (WER): 63.4
automatic-lyrics-transcription-on-jam-alt-1Whisper v3 +demucs +lang
Case-Sensitive Word Error Rate: 47.2
Line break F-1: 66.9
Punctuation F-1: 25.8
Word Error Rate (WER): 43.0
automatic-lyrics-transcription-on-jam-alt-1LyricWhiz
Case-Sensitive Word Error Rate: 28.0
Line break F-1: 74.0
Punctuation F-1: 34.0
Section break F-1: 1.4
Word Error Rate (WER): 24.6
automatic-lyrics-transcription-on-jam-alt-1Whisper v2
Case-Sensitive Word Error Rate: 47.5
Line break F-1: 63.0
Punctuation F-1: 31.5
Section break F-1: 11.2
Word Error Rate (WER): 43.8
automatic-lyrics-transcription-on-jam-alt-1AudioShake v3
Case-Sensitive Word Error Rate: 20.9
Line break F-1: 84.3
Parenthesis F-1: 37.9
Punctuation F-1: 65.3
Section break F-1: 84.8
Word Error Rate (WER): 17.3
automatic-lyrics-transcription-on-jam-alt-1Whisper v2 +demucs
Case-Sensitive Word Error Rate: 39.1
Line break F-1: 53.9
Punctuation F-1: 42.2
Word Error Rate (WER): 33.3
automatic-lyrics-transcription-on-jam-alt-1OWSM v3.1 +lang
Case-Sensitive Word Error Rate: 74.0
Line break F-1: 42.7
Punctuation F-1: 22.3
Word Error Rate (WER): 68.6
automatic-lyrics-transcription-on-jam-alt-2OWSM v3.1 +demucs +lang
Case-Sensitive Word Error Rate: 76.0
Line break F-1: 33.5
Punctuation F-1: 9.0
Word Error Rate (WER): 70.8
automatic-lyrics-transcription-on-jam-alt-2AudioShake v3
Case-Sensitive Word Error Rate: 17.7
Line break F-1: 81.5
Parenthesis F-1: 4.2
Punctuation F-1: 56.7
Section break F-1: 66.4
Word Error Rate (WER): 12.6
automatic-lyrics-transcription-on-jam-alt-2Whisper v2 +demucs +lang
Case-Sensitive Word Error Rate: 42.2
Line break F-1: 52.6
Punctuation F-1: 34.3
Word Error Rate (WER): 34.9
automatic-lyrics-transcription-on-jam-alt-2Whisper v2 +lang
Case-Sensitive Word Error Rate: 27.7
Line break F-1: 71.5
Punctuation F-1: 52.5
Section break F-1: 3.1
Word Error Rate (WER): 21.9
automatic-lyrics-transcription-on-jam-alt-2Whisper v3 +lang
Case-Sensitive Word Error Rate: 28.0
Line break F-1: 74.5
Punctuation F-1: 44.5
Section break F-1: 0.0
Word Error Rate (WER): 22.4
automatic-lyrics-transcription-on-jam-alt-2Whisper v3 +demucs
Case-Sensitive Word Error Rate: 64.9
Line break F-1: 52.3
Punctuation F-1: 32.4
Word Error Rate (WER): 61.5
automatic-lyrics-transcription-on-jam-alt-2Whisper v3
Case-Sensitive Word Error Rate: 33.6
Line break F-1: 73.7
Punctuation F-1: 42.5
Word Error Rate (WER): 28.6
automatic-lyrics-transcription-on-jam-alt-2Whisper v2 +demucs
Case-Sensitive Word Error Rate: 46.5
Line break F-1: 56.6
Punctuation F-1: 40.4
Word Error Rate (WER): 39.6
automatic-lyrics-transcription-on-jam-alt-2Whisper v2
Case-Sensitive Word Error Rate: 31.5
Line break F-1: 71.7
Punctuation F-1: 52.8
Section break F-1: 3.1
Word Error Rate (WER): 25.8
automatic-lyrics-transcription-on-jam-alt-2OWSM v3.1 +lang
Case-Sensitive Word Error Rate: 78.5
Line break F-1: 30.2
Parenthesis F-1: 0.0
Punctuation F-1: 8.8
Word Error Rate (WER): 73.3
automatic-lyrics-transcription-on-jam-alt-2Whisper v3 +demucs +lang
Case-Sensitive Word Error Rate: 62.1
Line break F-1: 54.7
Punctuation F-1: 34.4
Word Error Rate (WER): 58.6
automatic-lyrics-transcription-on-jam-alt-3Whisper v2 +lang
Case-Sensitive Word Error Rate: 26.0
Line break F-1: 71.7
Punctuation F-1: 48.4
Word Error Rate (WER): 19.9
automatic-lyrics-transcription-on-jam-alt-3Whisper v2 +demucs
Case-Sensitive Word Error Rate: 70.4
Line break F-1: 67.3
Punctuation F-1: 49.1
Word Error Rate (WER): 65.2
automatic-lyrics-transcription-on-jam-alt-3Whisper v3 +demucs
Case-Sensitive Word Error Rate: 47.4
Line break F-1: 71.9
Punctuation F-1: 45.4
Word Error Rate (WER): 43.5
automatic-lyrics-transcription-on-jam-alt-3Whisper v3
Case-Sensitive Word Error Rate: 44.6
Line break F-1: 71.1
Punctuation F-1: 47.3
Section break F-1: 1.2
Word Error Rate (WER): 40.7
automatic-lyrics-transcription-on-jam-alt-3Whisper v2 +demucs +lang
Case-Sensitive Word Error Rate: 30.4
Line break F-1: 70.6
Punctuation F-1: 49.2
Word Error Rate (WER): 23.9
automatic-lyrics-transcription-on-jam-alt-3OWSM v3.1 +lang
Case-Sensitive Word Error Rate: 71.8
Line break F-1: 40.7
Parenthesis F-1: 0.0
Punctuation F-1: 28.6
Word Error Rate (WER): 63.3
automatic-lyrics-transcription-on-jam-alt-3Whisper v3 +demucs +lang
Case-Sensitive Word Error Rate: 44.9
Line break F-1: 70.5
Punctuation F-1: 46.9
Word Error Rate (WER): 40.8
automatic-lyrics-transcription-on-jam-alt-3Whisper v2
Case-Sensitive Word Error Rate: 59.3
Line break F-1: 70.0
Punctuation F-1: 47.1
Word Error Rate (WER): 54.5
automatic-lyrics-transcription-on-jam-alt-3Whisper v3 +lang
Case-Sensitive Word Error Rate: 40.4
Line break F-1: 71.1
Punctuation F-1: 47.4
Word Error Rate (WER): 35.9
automatic-lyrics-transcription-on-jam-alt-3OWSM v3.1 +demucs +lang
Case-Sensitive Word Error Rate: 62.0
Line break F-1: 41.4
Punctuation F-1: 24.7
Word Error Rate (WER): 51.8
automatic-lyrics-transcription-on-jam-alt-3AudioShake v3
Case-Sensitive Word Error Rate: 17.5
Line break F-1: 83.7
Parenthesis F-1: 76.6
Punctuation F-1: 57.1
Section break F-1: 74.5
Word Error Rate (WER): 12.6
automatic-lyrics-transcription-on-jam-alt-4Whisper v2
Case-Sensitive Word Error Rate: 31.1
Line break F-1: 73.4
Punctuation F-1: 45.9
Section break F-1: 1.4
Word Error Rate (WER): 27.7
automatic-lyrics-transcription-on-jam-alt-4Whisper v2 +lang
Case-Sensitive Word Error Rate: 30.5
Line break F-1: 73.7
Punctuation F-1: 45.3
Word Error Rate (WER): 27.1
automatic-lyrics-transcription-on-jam-alt-4Whisper v2 +demucs +lang
Case-Sensitive Word Error Rate: 42.1
Line break F-1: 65.6
Punctuation F-1: 36.1
Word Error Rate (WER): 38.2
automatic-lyrics-transcription-on-jam-alt-4Whisper v3 +lang
Case-Sensitive Word Error Rate: 38.0
Line break F-1: 77.9
Punctuation F-1: 42.3
Word Error Rate (WER): 34.7
automatic-lyrics-transcription-on-jam-alt-4Whisper v3 +demucs
Case-Sensitive Word Error Rate: 48.2
Line break F-1: 69.3
Punctuation F-1: 32.0
Word Error Rate (WER): 44.9
automatic-lyrics-transcription-on-jam-alt-4Whisper v3
Case-Sensitive Word Error Rate: 38.0
Line break F-1: 77.9
Punctuation F-1: 42.5
Word Error Rate (WER): 34.7
automatic-lyrics-transcription-on-jam-alt-4OWSM v3.1 +lang
Case-Sensitive Word Error Rate: 75.7
Line break F-1: 36.0
Parenthesis F-1: 1.9
Punctuation F-1: 30.6
Word Error Rate (WER): 71.6
automatic-lyrics-transcription-on-jam-alt-4OWSM v3.1 +demucs +lang
Case-Sensitive Word Error Rate: 82.1
Line break F-1: 40.9
Parenthesis F-1: 0.0
Punctuation F-1: 22.3
Word Error Rate (WER): 78.5
automatic-lyrics-transcription-on-jam-alt-4Whisper v3 +demucs +lang
Case-Sensitive Word Error Rate: 48.3
Line break F-1: 69.3
Punctuation F-1: 32.0
Word Error Rate (WER): 44.9
automatic-lyrics-transcription-on-jam-alt-4Whisper v2 +demucs
Case-Sensitive Word Error Rate: 46.9
Line break F-1: 66.0
Punctuation F-1: 38.0
Word Error Rate (WER): 43.3
automatic-lyrics-transcription-on-jam-alt-4AudioShake v3
Case-Sensitive Word Error Rate: 23.5
Line break F-1: 88.6
Parenthesis F-1: 3.2
Punctuation F-1: 46.1
Section break F-1: 69.0
Word Error Rate (WER): 20.8

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
歌词转录面向人类:一项注重可读性的基准测试 | 论文 | HyperAI超神经