
摘要
为人类阅读而记录歌词不仅需要准确捕捉词序,还必须加入标点符号和格式以提高清晰度并传达上下文信息。这包括歌曲结构、情感强调以及主唱与和声之间的对比。尽管自动歌词转录(Automatic Lyrics Transcription, ALT)系统已经超越了仅生成无结构的词串,能够利用更广泛的上下文,但ALT基准测试尚未跟上这一进步的步伐,仍然专注于单词本身。为了弥补这一差距,我们引入了Jam-ALT,一个全面的歌词转录基准测试。该基准测试对JamendoLyrics数据集进行了彻底修订,遵循行业标准进行歌词转录和格式化,并设计了评估指标以捕捉和评估特定于歌词的细微差别,为提高歌词可读性奠定了基础。我们将该基准应用于最近的转录系统,并提供了额外的错误分析,以及与古典音乐数据集的实验对比。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| automatic-lyrics-transcription-on-jam-alt | Whisper v3 +demucs | Case-Sensitive Word Error Rate: 51.6 Line break F1: 65.7 Punctuation F1: 33.0 Word Error Rate (WER): 48.0 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v2 +demucs | Case-Sensitive Word Error Rate: 49.8 Line break F1: 61.2 Punctuation F1: 41.6 Word Error Rate (WER): 44.5 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v3 | Case-Sensitive Word Error Rate: 39.7 Line break F1: 73.5 Punctuation F1: 43.0 Section break F1: 1.0 Word Error Rate (WER): 35.5 |
| automatic-lyrics-transcription-on-jam-alt | OWSM v3.1 +demucs +lang | Case-Sensitive Word Error Rate: 72.6 Line break F1: 41.1 Parenthesis F-1: 0.0 Punctuation F1: 20.0 Word Error Rate (WER): 66.5 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v3 +demucs +lang | Case-Sensitive Word Error Rate: 50.4 Line break F1: 65.8 Punctuation F1: 33.7 Word Error Rate (WER): 46.6 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v2 | Case-Sensitive Word Error Rate: 42.1 Line break F1: 69.3 Punctuation F1: 44.2 Section break F1: 3.3 Word Error Rate (WER): 37.8 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v3 +lang | Case-Sensitive Word Error Rate: 37.2 Line break F1: 73.9 Punctuation F1: 43.7 Section break F1: 0.6 Word Error Rate (WER): 32.6 |
| automatic-lyrics-transcription-on-jam-alt | AudioShake v3 | Case-Sensitive Word Error Rate: 20.1 Line break F1: 84.4 Parenthesis F-1: 29.4 Punctuation F1: 57.0 Section break F1: 73.9 Word Error Rate (WER): 16.1 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v2 +lang | Case-Sensitive Word Error Rate: 32.6 Line break F1: 70.4 Punctuation F1: 45.0 Section break F1: 3.7 Word Error Rate (WER): 27.9 |
| automatic-lyrics-transcription-on-jam-alt | OWSM v3.1 +lang | Case-Sensitive Word Error Rate: 75.0 Line break F1: 37.8 Parenthesis F-1: 0.6 Punctuation F1: 22.5 Word Error Rate (WER): 69.3 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v2 +demucs +lang | Case-Sensitive Word Error Rate: 39.3 Line break F1: 60.6 Punctuation F1: 39.4 Word Error Rate (WER): 33.5 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v3 | Case-Sensitive Word Error Rate: 42.5 Line break F-1: 71.5 Punctuation F-1: 41.4 Section break F-1: 2.6 Word Error Rate (WER): 37.7 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v3 +demucs | Case-Sensitive Word Error Rate: 47.2 Line break F-1: 66.9 Punctuation F-1: 25.8 Word Error Rate (WER): 43.0 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v3 +lang | Case-Sensitive Word Error Rate: 41.4 Line break F-1: 72.5 Punctuation F-1: 41.8 Section break F-1: 2.6 Word Error Rate (WER): 36.4 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v2 +demucs +lang | Case-Sensitive Word Error Rate: 41.3 Line break F-1: 53.4 Punctuation F-1: 41.8 Word Error Rate (WER): 35.6 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v2 +lang | Case-Sensitive Word Error Rate: 43.7 Line break F-1: 65.5 Punctuation F-1: 34.9 Section break F-1: 11.6 Word Error Rate (WER): 39.7 |
| automatic-lyrics-transcription-on-jam-alt-1 | OWSM v3.1 +demucs +lang | Case-Sensitive Word Error Rate: 69.4 Line break F-1: 47.3 Parenthesis F-1: 0.0 Punctuation F-1: 21.5 Word Error Rate (WER): 63.4 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v3 +demucs +lang | Case-Sensitive Word Error Rate: 47.2 Line break F-1: 66.9 Punctuation F-1: 25.8 Word Error Rate (WER): 43.0 |
| automatic-lyrics-transcription-on-jam-alt-1 | LyricWhiz | Case-Sensitive Word Error Rate: 28.0 Line break F-1: 74.0 Punctuation F-1: 34.0 Section break F-1: 1.4 Word Error Rate (WER): 24.6 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v2 | Case-Sensitive Word Error Rate: 47.5 Line break F-1: 63.0 Punctuation F-1: 31.5 Section break F-1: 11.2 Word Error Rate (WER): 43.8 |
| automatic-lyrics-transcription-on-jam-alt-1 | AudioShake v3 | Case-Sensitive Word Error Rate: 20.9 Line break F-1: 84.3 Parenthesis F-1: 37.9 Punctuation F-1: 65.3 Section break F-1: 84.8 Word Error Rate (WER): 17.3 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v2 +demucs | Case-Sensitive Word Error Rate: 39.1 Line break F-1: 53.9 Punctuation F-1: 42.2 Word Error Rate (WER): 33.3 |
| automatic-lyrics-transcription-on-jam-alt-1 | OWSM v3.1 +lang | Case-Sensitive Word Error Rate: 74.0 Line break F-1: 42.7 Punctuation F-1: 22.3 Word Error Rate (WER): 68.6 |
| automatic-lyrics-transcription-on-jam-alt-2 | OWSM v3.1 +demucs +lang | Case-Sensitive Word Error Rate: 76.0 Line break F-1: 33.5 Punctuation F-1: 9.0 Word Error Rate (WER): 70.8 |
| automatic-lyrics-transcription-on-jam-alt-2 | AudioShake v3 | Case-Sensitive Word Error Rate: 17.7 Line break F-1: 81.5 Parenthesis F-1: 4.2 Punctuation F-1: 56.7 Section break F-1: 66.4 Word Error Rate (WER): 12.6 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v2 +demucs +lang | Case-Sensitive Word Error Rate: 42.2 Line break F-1: 52.6 Punctuation F-1: 34.3 Word Error Rate (WER): 34.9 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v2 +lang | Case-Sensitive Word Error Rate: 27.7 Line break F-1: 71.5 Punctuation F-1: 52.5 Section break F-1: 3.1 Word Error Rate (WER): 21.9 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v3 +lang | Case-Sensitive Word Error Rate: 28.0 Line break F-1: 74.5 Punctuation F-1: 44.5 Section break F-1: 0.0 Word Error Rate (WER): 22.4 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v3 +demucs | Case-Sensitive Word Error Rate: 64.9 Line break F-1: 52.3 Punctuation F-1: 32.4 Word Error Rate (WER): 61.5 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v3 | Case-Sensitive Word Error Rate: 33.6 Line break F-1: 73.7 Punctuation F-1: 42.5 Word Error Rate (WER): 28.6 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v2 +demucs | Case-Sensitive Word Error Rate: 46.5 Line break F-1: 56.6 Punctuation F-1: 40.4 Word Error Rate (WER): 39.6 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v2 | Case-Sensitive Word Error Rate: 31.5 Line break F-1: 71.7 Punctuation F-1: 52.8 Section break F-1: 3.1 Word Error Rate (WER): 25.8 |
| automatic-lyrics-transcription-on-jam-alt-2 | OWSM v3.1 +lang | Case-Sensitive Word Error Rate: 78.5 Line break F-1: 30.2 Parenthesis F-1: 0.0 Punctuation F-1: 8.8 Word Error Rate (WER): 73.3 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v3 +demucs +lang | Case-Sensitive Word Error Rate: 62.1 Line break F-1: 54.7 Punctuation F-1: 34.4 Word Error Rate (WER): 58.6 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v2 +lang | Case-Sensitive Word Error Rate: 26.0 Line break F-1: 71.7 Punctuation F-1: 48.4 Word Error Rate (WER): 19.9 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v2 +demucs | Case-Sensitive Word Error Rate: 70.4 Line break F-1: 67.3 Punctuation F-1: 49.1 Word Error Rate (WER): 65.2 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v3 +demucs | Case-Sensitive Word Error Rate: 47.4 Line break F-1: 71.9 Punctuation F-1: 45.4 Word Error Rate (WER): 43.5 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v3 | Case-Sensitive Word Error Rate: 44.6 Line break F-1: 71.1 Punctuation F-1: 47.3 Section break F-1: 1.2 Word Error Rate (WER): 40.7 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v2 +demucs +lang | Case-Sensitive Word Error Rate: 30.4 Line break F-1: 70.6 Punctuation F-1: 49.2 Word Error Rate (WER): 23.9 |
| automatic-lyrics-transcription-on-jam-alt-3 | OWSM v3.1 +lang | Case-Sensitive Word Error Rate: 71.8 Line break F-1: 40.7 Parenthesis F-1: 0.0 Punctuation F-1: 28.6 Word Error Rate (WER): 63.3 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v3 +demucs +lang | Case-Sensitive Word Error Rate: 44.9 Line break F-1: 70.5 Punctuation F-1: 46.9 Word Error Rate (WER): 40.8 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v2 | Case-Sensitive Word Error Rate: 59.3 Line break F-1: 70.0 Punctuation F-1: 47.1 Word Error Rate (WER): 54.5 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v3 +lang | Case-Sensitive Word Error Rate: 40.4 Line break F-1: 71.1 Punctuation F-1: 47.4 Word Error Rate (WER): 35.9 |
| automatic-lyrics-transcription-on-jam-alt-3 | OWSM v3.1 +demucs +lang | Case-Sensitive Word Error Rate: 62.0 Line break F-1: 41.4 Punctuation F-1: 24.7 Word Error Rate (WER): 51.8 |
| automatic-lyrics-transcription-on-jam-alt-3 | AudioShake v3 | Case-Sensitive Word Error Rate: 17.5 Line break F-1: 83.7 Parenthesis F-1: 76.6 Punctuation F-1: 57.1 Section break F-1: 74.5 Word Error Rate (WER): 12.6 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v2 | Case-Sensitive Word Error Rate: 31.1 Line break F-1: 73.4 Punctuation F-1: 45.9 Section break F-1: 1.4 Word Error Rate (WER): 27.7 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v2 +lang | Case-Sensitive Word Error Rate: 30.5 Line break F-1: 73.7 Punctuation F-1: 45.3 Word Error Rate (WER): 27.1 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v2 +demucs +lang | Case-Sensitive Word Error Rate: 42.1 Line break F-1: 65.6 Punctuation F-1: 36.1 Word Error Rate (WER): 38.2 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v3 +lang | Case-Sensitive Word Error Rate: 38.0 Line break F-1: 77.9 Punctuation F-1: 42.3 Word Error Rate (WER): 34.7 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v3 +demucs | Case-Sensitive Word Error Rate: 48.2 Line break F-1: 69.3 Punctuation F-1: 32.0 Word Error Rate (WER): 44.9 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v3 | Case-Sensitive Word Error Rate: 38.0 Line break F-1: 77.9 Punctuation F-1: 42.5 Word Error Rate (WER): 34.7 |
| automatic-lyrics-transcription-on-jam-alt-4 | OWSM v3.1 +lang | Case-Sensitive Word Error Rate: 75.7 Line break F-1: 36.0 Parenthesis F-1: 1.9 Punctuation F-1: 30.6 Word Error Rate (WER): 71.6 |
| automatic-lyrics-transcription-on-jam-alt-4 | OWSM v3.1 +demucs +lang | Case-Sensitive Word Error Rate: 82.1 Line break F-1: 40.9 Parenthesis F-1: 0.0 Punctuation F-1: 22.3 Word Error Rate (WER): 78.5 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v3 +demucs +lang | Case-Sensitive Word Error Rate: 48.3 Line break F-1: 69.3 Punctuation F-1: 32.0 Word Error Rate (WER): 44.9 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v2 +demucs | Case-Sensitive Word Error Rate: 46.9 Line break F-1: 66.0 Punctuation F-1: 38.0 Word Error Rate (WER): 43.3 |
| automatic-lyrics-transcription-on-jam-alt-4 | AudioShake v3 | Case-Sensitive Word Error Rate: 23.5 Line break F-1: 88.6 Parenthesis F-1: 3.2 Punctuation F-1: 46.1 Section break F-1: 69.0 Word Error Rate (WER): 20.8 |