
摘要
当前自动歌词转录(ALT)基准测试仅关注词汇内容,忽略了书面歌词中的细微差别,包括格式和标点符号,这可能导致与音乐家和作词者的创意作品以及听众体验之间的潜在错位。例如,换行在传达节奏、情感强调、押韵和高层次结构信息方面起着重要作用。为了解决这一问题,我们引入了基于 JamendoLyrics 数据集的新歌词转录基准——Jam-ALT。我们的贡献有两方面:首先,对转录进行了全面修订,专门针对 ALT 评估,遵循新创建的注释指南,统一了音乐行业的标准,涵盖了标点符号、换行、拼写、背景人声和非词语声音等方面;其次,设计了一套评价指标,与传统的词错误率不同,这套指标能够捕捉到上述现象。我们希望所提出的基准测试能够促进 ALT 任务的发展,使转录系统的评估更加精确可靠,并提升歌词应用(如实时字幕或卡拉OK字幕渲染)的用户体验。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| automatic-lyrics-transcription-on-jam-alt | Whisper v2 | Case Error Rate: 4.5 Line break F1: 69.3 Punctuation F1: 41.7 Section break F1: 3.3 Word Error Rate (WER): 35.7 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v2 +demucs | Case Error Rate: 5.3 Line break F1: 61.2 Punctuation F1: 28.0 Word Error Rate (WER): 44.0 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v3 | Case Error Rate: 4.3 Line break F1: 73.5 Punctuation F1: 41.6 Section break F1: 1.0 Word Error Rate (WER): 35.5 |
| automatic-lyrics-transcription-on-jam-alt | Whisper v3 +demucs | Case Error Rate: 3.8 Line break F1: 65.7 Punctuation F1: 29.0 Word Error Rate (WER): 47.9 |
| automatic-lyrics-transcription-on-jam-alt | AudioShake v1 | Case Error Rate: 3.4 Line break F1: 82.3 Parenthesis F-1: 29.4 Punctuation F1: 50.5 Section break F1: 72.1 Word Error Rate (WER): 26.0 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v3 +demucs | Case Error Rate: 4.1 Line break F-1: 66.8 Punctuation F-1: 23.3 Word Error Rate (WER): 43.0 |
| automatic-lyrics-transcription-on-jam-alt-1 | AudioShake v1 | Case Error Rate: 3.4 Line break F-1: 80.7 Parenthesis F-1: 32.4 Punctuation F-1: 59.0 Section break F-1: 77.4 Word Error Rate (WER): 22.1 |
| automatic-lyrics-transcription-on-jam-alt-1 | LyricWhiz | Case Error Rate: 3.5 Line break F-1: 74.0 Punctuation F-1: 34.0 Section break F-1: 1.4 Word Error Rate (WER): 24.6 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v2 | Case Error Rate: 3.5 Line break F-1: 63.0 Punctuation F-1: 31.3 Section break F-1: 11.2 Word Error Rate (WER): 43.8 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v2 +demucs | Case Error Rate: 5.3 Line break F-1: 53.8 Punctuation F-1: 39.2 Word Error Rate (WER): 32.3 |
| automatic-lyrics-transcription-on-jam-alt-1 | Whisper v3 | Case Error Rate: 4.8 Line break F-1: 71.5 Punctuation F-1: 40.9 Section break F-1: 2.6 Word Error Rate (WER): 37.7 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v3 +demucs | Case Error Rate: 3.6 Line break F-1: 52.4 Punctuation F-1: 28.7 Word Error Rate (WER): 61.5 |
| automatic-lyrics-transcription-on-jam-alt-2 | AudioShake v1 | Case Error Rate: 4.1 Line break F-1: 82.7 Parenthesis F-1: 38.0 Punctuation F-1: 47.8 Section break F-1: 69.6 Word Error Rate (WER): 22.5 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v2 +demucs | Case Error Rate: 7.1 Line break F-1: 56.4 Punctuation F-1: 17.2 Word Error Rate (WER): 38.8 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v2 | Case Error Rate: 6.5 Line break F-1: 71.7 Punctuation F-1: 50.0 Section break F-1: 3.1 Word Error Rate (WER): 25.7 |
| automatic-lyrics-transcription-on-jam-alt-2 | Whisper v3 | Case Error Rate: 5.0 Line break F-1: 73.7 Punctuation F-1: 41.9 Word Error Rate (WER): 28.6 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v2 | Case Error Rate: 5.3 Line break F-1: 69.9 Punctuation F-1: 38.7 Word Error Rate (WER): 45.4 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v2 +demucs | Case Error Rate: 5.9 Line break F-1: 67.5 Punctuation F-1: 30.2 Word Error Rate (WER): 65.2 |
| automatic-lyrics-transcription-on-jam-alt-3 | AudioShake v1 | Case Error Rate: 4.1 Line break F-1: 81.2 Parenthesis F-1: 8.1 Punctuation F-1: 48.5 Section break F-1: 69.2 Word Error Rate (WER): 24.4 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v3 +demucs | Case Error Rate: 4.4 Line break F-1: 72.0 Punctuation F-1: 34.0 Word Error Rate (WER): 43.5 |
| automatic-lyrics-transcription-on-jam-alt-3 | Whisper v3 | Case Error Rate: 4.0 Line break F-1: 71.2 Punctuation F-1: 41.2 Section break F-1: 1.2 Word Error Rate (WER): 40.7 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v2 +demucs | Case Error Rate: 3.2 Line break F-1: 66.1 Punctuation F-1: 34.9 Word Error Rate (WER): 43.3 |
| automatic-lyrics-transcription-on-jam-alt-4 | AudioShake v1 | Case Error Rate: 2.0 Line break F-1: 84.9 Parenthesis F-1: 41.3 Punctuation F-1: 45.8 Section break F-1: 72.5 Word Error Rate (WER): 34.9 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v3 +demucs | Case Error Rate: 3.2 Line break F-1: 69.4 Punctuation F-1: 30.9 Word Error Rate (WER): 44.9 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v2 | Case Error Rate: 3.2 Line break F-1: 73.4 Punctuation F-1: 45.8 Section break F-1: 1.4 Word Error Rate (WER): 27.7 |
| automatic-lyrics-transcription-on-jam-alt-4 | Whisper v3 | Case Error Rate: 3.3 Line break F-1: 77.8 Punctuation F-1: 42.4 Word Error Rate (WER): 34.7 |