
摘要
基于骨架数据的图推理已成为人体动作识别领域一种极具前景的方法。然而,现有基于图的方法大多以完整的时序序列作为输入,在在线推理场景下应用时,往往导致显著的计算冗余。针对这一问题,本文通过将时空图卷积神经网络(Spatio-Temporal Graph Convolutional Neural Network)重构为一种持续推理网络(Continual Inference Network),实现了无需重复处理帧数据即可逐步进行时间序列预测。为评估所提方法,我们构建了ST-GCN的持续推理版本CoST-GCN,并进一步提出了两种采用不同自注意力机制的衍生方法:CoAGCN与CoS-TR。我们系统研究了权重迁移策略与网络结构优化对推理加速的影响,并在NTU RGB+D 60、NTU RGB+D 120以及Kinetics Skeleton 400三个公开数据集上进行了实验。在保持相近预测精度的前提下,实验结果表明,所提方法在时间复杂度上最高可降低109倍,硬件层面实现最高26倍的加速,同时在线推理过程中最大内存占用减少52%。
代码仓库
lukashedegaard/continual-skeletons
官方
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| skeleton-based-action-recognition-on-kinetics | ST-GCN (2-stream) | Accuracy: 34.4 GFLOPS per prediction: 24.09 |
| skeleton-based-action-recognition-on-kinetics | CoST-GCN* (1-stream) | Accuracy: 30.2 GFLOPS per prediction: 0.11 |
| skeleton-based-action-recognition-on-kinetics | CoAGCN* (1-stream) | Accuracy: 23.3 GFLOPS per prediction: 0.12 |
| skeleton-based-action-recognition-on-kinetics | CoST-GCN* (2-stream) | Accuracy: 32.2 GFLOPS per prediction: 0.22 |
| skeleton-based-action-recognition-on-kinetics | CoS-TR* (1-stream) | Accuracy: 27.4 GFLOPS per prediction: 0.11 |
| skeleton-based-action-recognition-on-kinetics | CoAGCN (1-stream) | Accuracy: 33 GFLOPS per prediction: 0.18 |
| skeleton-based-action-recognition-on-kinetics | CoST-GCN (1-stream) | Accuracy: 31.8 GFLOPS per prediction: 0.16 |
| skeleton-based-action-recognition-on-kinetics | CoAGCN (2-stream) | GFLOPS per prediction: 0.36 |
| skeleton-based-action-recognition-on-kinetics | CoS-TR* (2-stream) | Accuracy: 29.9 GFLOPS per prediction: 0.22 |
| skeleton-based-action-recognition-on-kinetics | CoST-GCN (2-stream) | Accuracy: 33.1 GFLOPS per prediction: 0.32 |
| skeleton-based-action-recognition-on-kinetics | CoS-TR (2-stream) | Accuracy: 32.7 GFLOPS per prediction: 0.31 |
| skeleton-based-action-recognition-on-kinetics | S-TR (1-stream) | Accuracy: 32 GFLOPS per prediction: 11.62 |
| skeleton-based-action-recognition-on-kinetics | CoS-TR (1-stream) | Accuracy: 29.7 |
| skeleton-based-action-recognition-on-kinetics | ST-GCN (1-stream) | Accuracy: 33.4 GFLOPS per prediction: 12.04 |
| skeleton-based-action-recognition-on-kinetics | AGCN (2-stream) | Accuracy: 36.9 GFLOPS per prediction: 26.91 |
| skeleton-based-action-recognition-on-kinetics | CoAGCN* (2-stream) | Accuracy: 27.5 GFLOPS per prediction: 0.25 |
| skeleton-based-action-recognition-on-kinetics | AGCN (1-stream) | Accuracy: 35 GFLOPS per prediction: 13.45 |
| skeleton-based-action-recognition-on-kinetics | S-TR (2-stream) | Accuracy: 34.7 GFLOPS per prediction: 23.24 |
| skeleton-based-action-recognition-on-ntu-rgbd | CoAGCN* (2-stream) | Accuracy (CS): 86.0 Accuracy (CV): 93.1 GFLOPs per pred: 0.44 |
| skeleton-based-action-recognition-on-ntu-rgbd | CoST-GCN* (2-stream) | Accuracy (CS): 88.3 Accuracy (CV): 95 GFLOPs per pred: 0.32 |
| skeleton-based-action-recognition-on-ntu-rgbd | CoS-TR* | Accuracy (CS): 86.3 Accuracy (CV): 92.4 GFLOPs per pred: 0.15 |
| skeleton-based-action-recognition-on-ntu-rgbd | CoS-TR* (2-stream) | Accuracy (CS): 88.9 Accuracy (CV): 94.8 GFLOPs per pred: 0.3 |
| skeleton-based-action-recognition-on-ntu-rgbd | ST-GCN | Accuracy (CS): 86 Accuracy (CV): 93.4 GFLOPs per pred: 16.73 |
| skeleton-based-action-recognition-on-ntu-rgbd | CoAGCN* | Accuracy (CS): 84.1 Accuracy (CV): 92.6 |
| skeleton-based-action-recognition-on-ntu-rgbd | CoST-GCN* | Accuracy (CS): 86.3 Accuracy (CV): 93.8 GFLOPs per pred: 0.16 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | S-TR (1-stream) | Accuracy (Cross-Setup): 81.8 Accuracy (Cross-Subject): 80.2 GFLOPS per prediction: 16.2 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | ST-GCN (1-stream) | Accuracy (Cross-Subject): 79 GFLOPS per prediction: 16.73 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | AGCN (1-stream) | Accuracy (Cross-Setup): 80.7 Accuracy (Cross-Subject): 79.7 GFLOPS per prediction: 18.69 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | CoS-TR* (2-stream) | Accuracy (Cross-Setup): 86.1 Accuracy (Cross-Subject): 84.8 GFLOPS per prediction: 0.3 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | CoST-GCN* (1-stream) | Accuracy (Cross-Setup): 81.6 Accuracy (Cross-Subject): 79.4 GFLOPS per prediction: 0.16 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | CoST-GCN* (2-stream) | Accuracy (Cross-Setup): 85.5 Accuracy (Cross-Subject): 84.0 GFLOPS per prediction: 0.32 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | CoAGCN* (2-stream) | Accuracy (Cross-Setup): 82 Accuracy (Cross-Subject): 80.4 GFLOPS per prediction: 0.44 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | CoS-TR* (1-stream) | Accuracy (Cross-Setup): 81.7 Accuracy (Cross-Subject): 79.7 GFLOPS per prediction: 0.15 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | AGCN (2-stream) | Accuracy (Cross-Setup): 85.4 Accuracy (Cross-Subject): 84 GFLOPS per prediction: 37.38 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | CoAGCN* (1-stream) | Accuracy (Cross-Setup): 79.1 Accuracy (Cross-Subject): 77.3 GFLOPS per prediction: 0.22 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | ST-GCN (2-stream) | Accuracy (Cross-Setup): 85.1 Accuracy (Cross-Subject): 83.7 GFLOPS per prediction: 33.46 |
| skeleton-based-action-recognition-on-ntu-rgbd-1 | S-TR (2-stream) | Accuracy (Cross-Setup): 86.2 Accuracy (Cross-Subject): 84.8 GFLOPS per prediction: 32.4 |