
摘要
在动作识别任务中,模型通常先在图像分类数据集(如 ImageNet)上进行预训练,随后在目标动作识别任务的视频数据上进行微调。这一方法在近年来基于 Transformer 的视频架构中取得了良好的实证效果。尽管当前已有大量研究致力于设计更先进的 Transformer 架构以提升动作识别性能,但针对视频 Transformer 如何有效训练的研究仍相对不足。本文系统探索了多种训练范式,并得出两个关键发现:首先,视频 Transformer 在多种视频数据集及标签空间(如 Kinetics 侧重外观特征,SomethingSomething 侧重运动特征)上进行联合训练时,性能显著提升;其次,通过进一步与图像(作为单帧视频)进行协同训练,视频 Transformer 能够学习到更优的视频表征。我们称此方法为“动作识别中的视频与图像协同训练”(Co-training Videos and Images for Action Recognition, CoVeR)。具体而言,在基于 TimeSFormer 架构并在 ImageNet-21K 上预训练的情况下,CoVeR 将 Kinetics-400 的 Top-1 准确率提升 2.4%,Kinetics-600 提升 2.3%,SomethingSomething-v2 提升 2.3%。当在更大规模图像数据集上预训练以延续先前最先进方法时,CoVeR 在 Kinetics-400(87.2%)、Kinetics-600(87.9%)、Kinetics-700(79.8%)、SomethingSomething-v2(70.9%)和 Moments-in-Time(46.1%)上均取得了当前最优结果,且仅采用一个简单的时空视频 Transformer 架构。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-kinetics-400 | CoVeR (JFT-3B) | Acc@1: 87.2 Acc@5: 97.5 |
| action-classification-on-kinetics-400 | CoVeR (JFT-300M) | Acc@1: 86.3 Acc@5: 97.2 |
| action-classification-on-kinetics-600 | CoVeR (JFT-300M) | Top-1 Accuracy: 86.8 Top-5 Accuracy: 97.3 |
| action-classification-on-kinetics-600 | CoVeR (JFT-3B) | Top-1 Accuracy: 87.9 Top-5 Accuracy: 97.8 |
| action-classification-on-kinetics-700 | CoVeR (JFT-3B) | Top-1 Accuracy: 79.8 Top-5 Accuracy: 94.9 |
| action-classification-on-kinetics-700 | CoVeR (JFT-300M) | Top-1 Accuracy: 78.5 Top-5 Accuracy: 94.2 |
| action-classification-on-moments-in-time | CoVeR(JFT-3B) | Top 1 Accuracy: 46.1 Top 5 Accuracy: 75.4 |
| action-classification-on-moments-in-time | CoVeR(JFT-300M) | Top 1 Accuracy: 45.0 Top 5 Accuracy: 73.9 |
| action-recognition-in-videos-on-something | CoVeR(JFT-3B) | Top-1 Accuracy: 70.9 Top-5 Accuracy: 92.5 |
| action-recognition-in-videos-on-something | CoVeR(JFT-300M) | Top-1 Accuracy: 69.8 Top-5 Accuracy: 91.9 |