
摘要
我们提出了一种自监督学习方法,用于从视频和音频中学习视听表示。该方法利用对比学习实现视频与音频之间的跨模态判别,反之亦然。我们证明了优化跨模态判别而非单模态内判别对于从视频和音频中学习高质量的表示至关重要。基于这一简单而强大的见解,我们的方法在微调后的动作识别任务中表现出色,达到了极具竞争力的性能。此外,近期关于对比学习的研究通常将正样本和负样本定义为单独的实例,而我们通过探索跨模态一致性对此定义进行了扩展。我们通过测量多个实例在视频和音频特征空间中的相似性,将其归类为正样本。跨模态一致性创建了更好的正样本和负样本集合,这使我们能够在寻求单模态内正样本判别的同时校准视觉相似性,并在下游任务中取得显著的性能提升。
代码仓库
facebookresearch/AVID-CMA
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| audio-classification-on-esc-50 | AVID | Top-1 Accuracy: 89.2 |
| self-supervised-action-recognition-on-hmdb51 | AVID (Modified R2+1D-18 on Kinetics) | Frozen: false Pre-Training Dataset: Kinetics400 (Video+Audio) Top-1 Accuracy: 59.9 |
| self-supervised-action-recognition-on-hmdb51 | AVID+CMA (Modified R2+1D-18 on Kinetics) | Frozen: false Pre-Training Dataset: Kinetics400 (Video+Audio) Top-1 Accuracy: 60.8 |
| self-supervised-action-recognition-on-hmdb51 | AVID+CMA (Modified R2+1D-18 on Audioset) | Frozen: false Pre-Training Dataset: Audioset (Video+Audio) Top-1 Accuracy: 64.7 |
| self-supervised-action-recognition-on-hmdb51 | AVID (Modified R2+1D-18 on Audioset) | Frozen: false Pre-Training Dataset: Audioset (Video+Audio) Top-1 Accuracy: 64.1 |
| self-supervised-action-recognition-on-hmdb51-1 | AVID | Top-1 Accuracy: 64.7 |
| self-supervised-action-recognition-on-ucf101 | AVID (Modified R2+1D-18 on Audioset) | 3-fold Accuracy: 91.0 Frozen: false Pre-Training Dataset: Audioset (Audio+Video) |
| self-supervised-action-recognition-on-ucf101 | AVID+CMA (Modified R2+1D-18 on Kinetics) | 3-fold Accuracy: 87.5 Frozen: false Pre-Training Dataset: Kinetics400 (Audio+Video) |
| self-supervised-action-recognition-on-ucf101 | AVID+CMA (Modified R2+1D-18 on Audioset) | 3-fold Accuracy: 91.5 Frozen: false Pre-Training Dataset: Audioset (Audio+Video) |
| self-supervised-action-recognition-on-ucf101 | AVID (Modified R2+1D-18 on Kinetics) | 3-fold Accuracy: 86.9 Frozen: false Pre-Training Dataset: Kinetics400 (Audio+Video) |
| self-supervised-action-recognition-on-ucf101-1 | AVID | 3-fold Accuracy: 91.5 |