
摘要
目前可用的少样本学习(基于少量训练样本的机器学习)基准测试在涵盖的领域上存在局限性,主要集中在图像分类。本研究旨在通过提供首个全面、公开且完全可复现的音频基准测试来缓解对图像基准测试的依赖,该基准测试覆盖了多种声音领域和实验设置。我们比较了多种技术在七个音频数据集上的少样本分类性能(这些数据集涵盖了从环境声音到人类语音的各种类型)。在此基础上,我们对联合训练(即所有数据集均用于训练过程)和跨数据集适应协议进行了深入分析,证明了通用音频少样本分类算法的可能性。我们的实验结果表明,基于梯度的元学习方法如MAML和Meta-Curvature在性能上始终优于度量方法和基线方法。此外,我们还展示了联合训练程序有助于提高所包含的环境声音数据库的整体泛化能力,并且在一定程度上也是解决跨数据集/领域问题的有效方法。
代码仓库
cheggan/metaaudio-a-few-shot-audio-classification-benchmark
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| few-shot-audio-classification-on | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 40.27 +- 0.44 |
| few-shot-audio-classification-on | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 43.45 +- 0.46 |
| few-shot-audio-classification-on | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 38.78 +- 0.41 |
| few-shot-audio-classification-on | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 39.44 +- 0.44 |
| few-shot-audio-classification-on | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 43.18 +- 0.45 |
| few-shot-audio-classification-on | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 33.52 +- 0.39 |
| few-shot-audio-classification-on | SimpleShot CL2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 42.05 +- 0.42 |
| few-shot-audio-classification-on-birdclef | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 56.11 +- 0.46 |
| few-shot-audio-classification-on-birdclef | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 36.41 +- 0.42 |
| few-shot-audio-classification-on-birdclef | SimpleShot Cl2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 57.66 +- 0.43 |
| few-shot-audio-classification-on-birdclef | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 33.04 +- 0.41 |
| few-shot-audio-classification-on-birdclef | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 56.26 +- 0.45 |
| few-shot-audio-classification-on-birdclef | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 61.34 +- 0.46 |
| few-shot-audio-classification-on-birdclef | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 57.28 +- 0.41 |
| few-shot-audio-classification-on-esc-50 | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 60.41 +- 0.41 |
| few-shot-audio-classification-on-esc-50 | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 68.83 +- 0.38 |
| few-shot-audio-classification-on-esc-50 | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 76.17 +- 0.41 |
| few-shot-audio-classification-on-esc-50 | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 64.48 +- 0.41 |
| few-shot-audio-classification-on-esc-50 | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 74.66 ± 0.42 |
| few-shot-audio-classification-on-esc-50 | SimpleShot CL2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 68.82 +-0.39 |
| few-shot-audio-classification-on-esc-50 | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 71.72 +- 0.38 |
| few-shot-audio-classification-on-nsynth | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 93.85 +- 0.24 |
| few-shot-audio-classification-on-nsynth | SimpleShot CL2N Classifier (AST pre-trained w/ ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 66.68 +- 0.41 |
| few-shot-audio-classification-on-nsynth | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 90.74 +- 0.25 |
| few-shot-audio-classification-on-nsynth | SimpleShot CL2N Classifier (AST ImageNet & AudioSet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 63.78 +- 0.42 |
| few-shot-audio-classification-on-nsynth | SimpleShot CL2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 90.04 +- 0.27 |
| few-shot-audio-classification-on-nsynth | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 96.47 +-0.19 |
| few-shot-audio-classification-on-nsynth | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 95.23 +- 0.19 |
| few-shot-audio-classification-on-voxceleb1 | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 63.85 +- 0.44 |
| few-shot-audio-classification-on-voxceleb1 | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 28.09 +- 0.37 |
| few-shot-audio-classification-on-voxceleb1 | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 59.64 +- 0.44 |
| few-shot-audio-classification-on-voxceleb1 | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 55.54 +- 0.42 |
| few-shot-audio-classification-on-voxceleb1 | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 60.89 +- 0.45 |
| few-shot-audio-classification-on-voxceleb1 | SimpleShot CL2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 48.50 +- 0.42 |
| few-shot-audio-classification-on-voxceleb1 | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 28.79 +- 0.38 |
| few-shot-audio-classification-on-watkins | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 51.81 ± 0.42 |
| few-shot-audio-classification-on-watkins | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 55.40 ± 0.42 |