
摘要
在本文中,我们首先将最近的掩码自动编码器(Masked Auto-Encoder, MAE)模型从单一模态扩展到视听多模态。随后,我们提出了对比视听掩码自动编码器(Contrastive Audio-Visual Masked Auto-Encoder, CAV-MAE),通过结合对比学习和掩码数据建模这两种主要的自监督学习框架,来学习联合且协调的视听表示。实验结果表明,对比视听对应学习目标不仅使模型能够执行视听检索任务,还帮助模型学习更好的联合表示。因此,我们的完全自监督预训练的CAV-MAE在VGGSound数据集上达到了65.9%的新最先进(SOTA)准确率,并在AudioSet数据集上的视听事件分类任务中与之前最佳的有监督预训练模型相当。代码和预训练模型可在https://github.com/yuangongnd/cav-mae 获取。
代码仓库
yuangongnd/cav-mae
官方
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| audio-classification-on-audioset | CAV-MAE (Audio-Visual) | Test mAP: 0.512 |
| audio-classification-on-audioset | CAV-MAE (Audio-Only) | Test mAP: 0.466 |
| audio-classification-on-audioset | CAV-MAE (Visual-Only) | Test mAP: 0.262 |
| audio-classification-on-vggsound | CAV-MAE (Audio-Visual) | Top 1 Accuracy: 65.9 |
| audio-classification-on-vggsound | CAV-MAE (Audio-Only) | Top 1 Accuracy: 59.5 |
| audio-tagging-on-audioset | CAV-MAE (Audio-Visual) | mean average precision: 0.512 |
| audio-tagging-on-audioset | CAV-MAE (Audio-Only) | mean average precision: 0.466 |
| multi-modal-classification-on-audioset | CAV-MAE | Average mAP: 0.512 |
| multi-modal-classification-on-vgg-sound | CAV-MAE (Audio-Visual) | Top-1 Accuracy: 65.9 |
| sound-prompted-semantic-segmentation-on | CAVMAE | mAP: 26.0 mIoU: 17.0 |
| speech-prompted-semantic-segmentation-on | CAVMAE | mAP: 27.2 mIoU: 19.9 |