
摘要
克服时间变化的泛化是实现视频中有效动作识别的前提。尽管深度神经网络取得了显著进展,但在整体动作性能背景下,如何聚焦于短时判别性运动仍是一个挑战。为此,我们通过在发现相关时空特征时引入一定灵活性来应对这一难题。本文提出一种名为“压缩与递归时间门”(Squeeze and Recursion Temporal Gates, SRTG)的新方法,该方法倾向于选择在潜在时间变化下具有相似激活模式的输入。我们通过一种新型卷积神经网络(CNN)模块实现该思想:该模块利用长短期记忆网络(LSTM)捕捉特征动态,并结合一个时间门机制,用于评估所发现动态与建模特征之间的一致性。实验结果表明,引入SRTG模块可带来稳定性能提升,且仅带来极小的计算量增加(GFLOPs)。在Kinetics-700数据集上,我们的方法达到与当前最先进模型相当的性能;在HACS、Moments in Time、UCF-101和HMDB-51等数据集上,我们的方法均优于现有先进模型。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-kinetics-700 | SRTG r(2+1)d-34 | Top-1 Accuracy: 49.43 Top-5 Accuracy: 73.23 |
| action-classification-on-kinetics-700 | SRTG r3d-50 | Top-1 Accuracy: 53.52 Top-5 Accuracy: 74.17 |
| action-classification-on-kinetics-700 | SRTG r3d-101 | Top-1 Accuracy: 56.46 Top-5 Accuracy: 76.82 |
| action-classification-on-kinetics-700 | SRTG r3d-34 | Top-1 Accuracy: 49.15 Top-5 Accuracy: 72.68 |
| action-classification-on-kinetics-700 | SRTG r(2+1)d-50 | Top-1 Accuracy: 54.17 Top-5 Accuracy: 74.62 |
| action-classification-on-moments-in-time | SRTG r3d-34 | Top 1 Accuracy: 28.55 Top 5 Accuracy: 52.35 |
| action-classification-on-moments-in-time | SRTG r3d-101 | Top 1 Accuracy: 33.56 Top 5 Accuracy: 58.49 |
| action-classification-on-moments-in-time | SRTG r3d-50 | Top 1 Accuracy: 30.72 Top 5 Accuracy: 55.65 |
| action-classification-on-moments-in-time | SRTG r(2+1)d-50 | Top 1 Accuracy: 31.60 Top 5 Accuracy: 56.80 |
| action-classification-on-moments-in-time | SRTG r(2+1)d-34 | Top 1 Accuracy: 28.97 Top 5 Accuracy: 54.18 |
| action-recognition-on-hacs | SRTG r(2+1)d-101 | Top 1 Accuracy: 84.33 Top 5 Accuracy: 96.85 |
| action-recognition-on-hacs | SRTG r3d-34 | Top 1 Accuracy: 78.60 Top 5 Accuracy: 93.57 |
| action-recognition-on-hacs | SRTG r3d-101 | Top 1 Accuracy: 81.66 Top 5 Accuracy: 96.33 |
| action-recognition-on-hacs | SRTG r(2+1)d-50 | Top 1 Accuracy: 83.77 Top 5 Accuracy: 96.56 |
| action-recognition-on-hacs | SRTG r(2+1)d-34 | Top 1 Accuracy: 80.39 Top 5 Accuracy: 94.27 |
| action-recognition-on-hacs | SRTG r3d-50 | Top 1 Accuracy: 80.36 Top 5 Accuracy: 95.55 |