
摘要
运动在视频理解中起着至关重要的作用。目前大多数先进的视频分类神经网络模型通常通过外部预训练方法提取的光流(optical flow)来引入运动信息。然而,由于逐帧光流计算量巨大,如何高效地融入运动信息始终是视频理解任务中的主要计算瓶颈。在本工作中,我们提出用内部轻量级的运动特征学习机制,替代传统的外部高计算成本的光流提取方法。为此,我们设计了一种可训练的神经模块——MotionSqueeze,用于高效提取运动特征。该模块可灵活插入任意神经网络的中间层,能够自动学习跨帧之间的对应关系,并将其转化为运动特征,直接输入后续网络层以提升预测性能。实验结果表明,该方法在四个标准动作识别基准数据集上均取得了显著性能提升,且仅带来极小的额外计算开销,在Something-Something-V1与V2数据集上甚至超越了当前最优水平。
代码仓库
arunos728/arunos728.github.io
GitHub 中提及
arunos728/MotionSqueeze
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-kinetics-400 | MSNet-R50 (16 frames, ImageNet pretrained) | Acc@1: 76.4 |
| action-recognition-in-videos-on-hmdb-51 | MSNet-R50 (16 frames, ImageNet pretrained) | Average accuracy of 3 splits: 77.4 |
| action-recognition-in-videos-on-something | MSNet-R50En (8+16 ensemble, ImageNet pretrained) | Top-1 Accuracy: 66.6 Top-5 Accuracy: 90.6 |
| action-recognition-in-videos-on-something | MSNet-R50 (16 frames, ImageNet pretrained) | Top-1 Accuracy: 64.7 Top-5 Accuracy: 89.4 |
| action-recognition-in-videos-on-something | MSNet-R50 (8 frames, ImageNet pretrained) | Top-1 Accuracy: 63 Top-5 Accuracy: 88.4 |
| action-recognition-in-videos-on-something-1 | MSNet-R50 (16 frames, ImageNet pretrained) | Top 1 Accuracy: 52.1 Top 5 Accuracy: 82.3 |
| action-recognition-in-videos-on-something-1 | MSNet-R50En (ensemble) | Top 1 Accuracy: 55.1 |
| action-recognition-in-videos-on-something-1 | MSNet-R50 (8 frames, ImageNet pretrained) | Top 1 Accuracy: 50.9 Top 5 Accuracy: 80.3 |
| action-recognition-in-videos-on-something-1 | MSNet-R50En (8+16 ensemble, ImageNet pretrained) | Top 1 Accuracy: 54.4 Top 5 Accuracy: 83.8 |
| video-classification-on-something-something | MSNet-R50En (ours) | Top-5 Accuracy: 84 |
| video-classification-on-something-something-1 | MSNet-R50En (ours) | Top-5 Accuracy: 91 |