
摘要
对比自监督学习因其能够从大规模无标签数据集中生成高质量表示而受到关注。这些强大的特征之所以能够实现下游任务的数据高效学习,关键在于它们提供了增强不变性,这通常是一种有用的归纳偏置。然而,不同下游任务所偏好不变性的数量和类型在事先并不确定,并且会有所不同。因此,我们提出了一种多任务自监督框架(MT-SLVR),该框架以参数高效的方式同时学习变化特征和不变特征。我们的多任务表示提供了一个强大且灵活的特征集,有助于多种下游任务。我们在来自不同音频领域的少样本分类任务中评估了我们的方法,并展示了在所有这些任务上的分类性能均有提升。
代码仓库
cheggan/mt-slvr
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| few-shot-audio-classification-on | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 39.11±0.41 |
| few-shot-audio-classification-on | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 37.64±0.40 |
| few-shot-audio-classification-on | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 21.72±0.34 |
| few-shot-audio-classification-on-birdclef | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 29.49±0.38 |
| few-shot-audio-classification-on-birdclef | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 30.93±0.38 |
| few-shot-audio-classification-on-birdclef | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 21.04±0.35 |
| few-shot-audio-classification-on-common-voice | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 23.00±0.42 |
| few-shot-audio-classification-on-common-voice | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 35.22±0.40 |
| few-shot-audio-classification-on-common-voice | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 33.33±0.38 |
| few-shot-audio-classification-on-crema-d | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 21.68±0.33 |
| few-shot-audio-classification-on-crema-d | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 29.61±0.38 |
| few-shot-audio-classification-on-crema-d | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 29.10±0.36 |
| few-shot-audio-classification-on-esc-50 | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 63.40±0.39 |
| few-shot-audio-classification-on-esc-50 | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 69.53±0.39 |
| few-shot-audio-classification-on-esc-50 | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 37.76±0.34 |
| few-shot-audio-classification-on-nsynth | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 71.81±0.39 |
| few-shot-audio-classification-on-nsynth | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 66.44±0.40 |
| few-shot-audio-classification-on-nsynth | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 62.52±0.36 |
| few-shot-audio-classification-on-speech | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 20.08±0.37 |
| few-shot-audio-classification-on-speech | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 25.68±0.35 |
| few-shot-audio-classification-on-speech | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 23.65±0.34 |
| few-shot-audio-classification-on-speech-1 | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 28.92±0.37 |
| few-shot-audio-classification-on-speech-1 | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 23.08±0.34 |
| few-shot-audio-classification-on-speech-1 | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 26.16±0.34 |
| few-shot-audio-classification-on-voxceleb1 | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 31.18±0.37 |
| few-shot-audio-classification-on-voxceleb1 | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 33.58±0.39 |
| few-shot-audio-classification-on-voxceleb1 | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 21.68±0.40 |
| few-shot-audio-classification-on-watkins | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 28.88±0.39 |
| few-shot-audio-classification-on-watkins | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 59.49±0.42 |
| few-shot-audio-classification-on-watkins | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 52.91±0.41 |