
摘要
近期基于Transformer的方法在3D人体姿态估计中展现了出色的性能。然而,这些方法通常具有全局视角,通过编码所有关节之间的全局关系,未能精确捕捉局部依赖关系。本文提出了一种新颖的注意力图卷积网络前向(Attention-GCNFormer, AGFormer)模块,该模块通过两条并行的Transformer和GCFormer流来划分通道数。我们提出的GCNFormer模块利用了相邻关节之间的局部关系,输出了一种与Transformer输出互补的新表示。通过以自适应的方式融合这两种表示,AGFormer展示了更好的学习潜在3D结构的能力。通过堆叠多个AGFormer模块,我们提出了四种不同变体的MotionAGFormer模型,可以根据速度与精度的权衡进行选择。我们在两个流行的基准数据集Human3.6M和MPI-INF-3DHP上评估了我们的模型。MotionAGFormer-B取得了最先进的结果,分别在这两个数据集上的P1误差为38.4毫米和16.2毫米。值得注意的是,它使用的参数量仅为之前领先模型的四分之一,并且在Human3.6M数据集上的计算效率提高了三倍。代码和模型可在https://github.com/TaatiTeam/MotionAGFormer 获取。
代码仓库
taatiteam/motionagformer
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| 3d-human-pose-estimation-on-human36m | MotionAGFormer-B (T=243) | #Frames: 243 Average MPJPE (mm): 19.4 Multi-View or Monocular: Monocular |
| 3d-human-pose-estimation-on-human36m | MotionAGFormer-XS (T=27) | #Frames: 27 Average MPJPE (mm): 28.1 Multi-View or Monocular: Monocular |
| 3d-human-pose-estimation-on-human36m | MotionAGFormer-S (T=81) | #Frames: 81 Average MPJPE (mm): 26.5 Multi-View or Monocular: Monocular |
| 3d-human-pose-estimation-on-human36m | MotionAGFormer-L (T=243) | #Frames: 243 Average MPJPE (mm): 17.3 Multi-View or Monocular: Monocular |
| 3d-human-pose-estimation-on-mpi-inf-3dhp | MotionAGFormer-L (T=81) | AUC: 85.3 MPJPE: 16.2 PCK: 98.2 |
| 3d-human-pose-estimation-on-mpi-inf-3dhp | MotionAGFormer-XS (T=27) | AUC: 83.5 MPJPE: 19.2 PCK: 98.2 |
| 3d-human-pose-estimation-on-mpi-inf-3dhp | MotionAGFormer-B (T=81) | AUC: 84.2 MPJPE: 18.2 PCK: 98.3 |
| 3d-human-pose-estimation-on-mpi-inf-3dhp | MotionAGFormer-S (T=81) | AUC: 84.5 MPJPE: 17.1 PCK: 98.3 |
| classification-on-full-body-parkinsons | MotionAGFormer | F1-score (weighted): 0.42 |
| monocular-3d-human-pose-estimation-on-human3 | MotionAGFormer-B | 2D detector: SH Average MPJPE (mm): 38.4 Frames Needed: 243 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |
| monocular-3d-human-pose-estimation-on-human3 | MotionAGFormer-S | 2D detector: SH Average MPJPE (mm): 42.5 Frames Needed: 81 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |
| monocular-3d-human-pose-estimation-on-human3 | MotionAGFormer-XS | 2D detector: SH Average MPJPE (mm): 45.1 Frames Needed: 27 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |
| monocular-3d-human-pose-estimation-on-human3 | MotionAGFormer-L | 2D detector: SH Average MPJPE (mm): 38.4 Frames Needed: 243 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |