
摘要
我们提出了一种自监督对比视频表征学习(Contrastive Video Representation Learning, CVRL)方法,用于从未标记的视频中学习时空视觉表征。我们的表征通过对比损失函数进行学习,其中来自同一段短视频的两个增强片段在嵌入空间中被拉近,而来自不同视频的片段则被推开。我们研究了哪些数据增强方法对视频自监督学习是有效的,并发现空间信息和时间信息都至关重要。因此,我们精心设计了涉及空间和时间线索的数据增强方法。具体而言,我们提出了一种时间一致的空间增强方法,在对视频中的每一帧施加强烈的空间增强的同时保持帧间的时间一致性。此外,我们还提出了一种基于采样的时间增强方法,以避免对时间上相距较远的片段过度强制不变性。在Kinetics-600数据集上,使用CVRL学到的表征训练的线性分类器在3D-ResNet-50(R3D-50)主干网络下达到了70.4%的Top-1准确率,比使用相同膨胀R3D-50网络的ImageNet监督预训练高出15.7%,比SimCLR无监督预训练高出18.8%。使用更大的R3D-152(滤波器数量翻倍)主干网络时,CVRL的性能可进一步提升至72.9%,显著缩小了无监督与监督视频表征学习之间的差距。我们的代码和模型将在https://github.com/tensorflow/models/tree/master/official/ 上提供。
代码仓库
applecrumble123/CVLR_pytorch
pytorch
GitHub 中提及
ed-fish/spatio-temporal-contrastive-video
pytorch
GitHub 中提及
ed-fish/spatio-temporal-contrastive-film
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| self-supervised-action-recognition-on | CVRL (R3D-50) | Top-1 Accuracy: 70.4 |
| self-supervised-action-recognition-on | CVRL (R3D-101) | Top-1 Accuracy: 71.6 |
| self-supervised-action-recognition-on | CVRL (R3D-152 2x) | Top-1 Accuracy: 72.9 |
| self-supervised-action-recognition-on-1 | CVRL (R3D-101) | Top-1 accuracy %: 67.6 |
| self-supervised-action-recognition-on-1 | CVRL (R3D-152 2x; K600 pretrain) | Top-1 accuracy %: 71.6 |
| self-supervised-action-recognition-on-1 | CVRL (R3D-50) | Top-1 accuracy %: 66.1 |
| self-supervised-action-recognition-on-hmdb51 | CVRL (R3D-152 2x; K600) | Frozen: false Pre-Training Dataset: Kinetics600 Top-1 Accuracy: 69.9 |
| self-supervised-action-recognition-on-hmdb51 | CVRL (R3D-50; K400) | Frozen: false Pre-Training Dataset: Kinetics400 Top-1 Accuracy: 66.7 |
| self-supervised-action-recognition-on-hmdb51 | CVRL (R3D-50; K600) | Frozen: false Pre-Training Dataset: Kinetics600 Top-1 Accuracy: 68.0 |
| self-supervised-action-recognition-on-hmdb51-1 | CVRL (R3D-152 2x; K600) | Pretraining Dataset: K600 Top-1 Accuracy: 69.9 |
| self-supervised-action-recognition-on-hmdb51-1 | CVRL (R3D-50; K600) | Pretraining Dataset: K600 Top-1 Accuracy: 68.0 |
| self-supervised-action-recognition-on-hmdb51-1 | CVRL (R3D-50; K400) | Pretraining Dataset: K400 Top-1 Accuracy: 66.7 |
| self-supervised-action-recognition-on-ucf101 | CVRL (R3D-50; K400) | 3-fold Accuracy: 92.2 Frozen: false Pre-Training Dataset: Kinetics400 |
| self-supervised-action-recognition-on-ucf101 | CVRL (R3D-50; K600) | 3-fold Accuracy: 93.4 Frozen: false Pre-Training Dataset: Kinetics600 |
| self-supervised-action-recognition-on-ucf101 | CVRL (R3D-152 2x; K600) | 3-fold Accuracy: 93.9 Frozen: false Pre-Training Dataset: Kinetics600 |
| self-supervised-action-recognition-on-ucf101-1 | CVRL (R3D-50; K400) | 3-fold Accuracy: 92.2 Pretrain: K400 |
| self-supervised-action-recognition-on-ucf101-1 | CVRL (R3D-50; K600) | 3-fold Accuracy: 93.4 Pretrain: K600 |
| self-supervised-action-recognition-on-ucf101-1 | CVRL (R3D-152 2x; K600) | 3-fold Accuracy: 93.9 Pretrain: K600 |