
摘要
近期,基于变压器(Transformer)的解决方案被引入用于从2D关键点序列中估计3D人体姿态。这些方法通过全局考虑所有帧中的身体关节来学习时空相关性。我们观察到不同关节的运动差异显著。然而,先前的方法无法高效地建模每个关节在帧间的固有对应关系,导致对时空相关性的学习不足。为此,我们提出了一种混合时空编码器(MixSTE),该编码器包含一个时间变压器模块,用于分别建模每个关节的时间运动;以及一个空间变压器模块,用于学习关节之间的空间相关性。这两个模块交替使用,以获得更好的时空特征编码效果。此外,网络输出从中心帧扩展到了输入视频的所有帧,从而提高了输入和输出序列之间的一致性。我们在三个基准数据集(Human3.6M、MPI-INF-3DHP 和 HumanEva)上进行了广泛的实验。实验结果表明,我们的模型在P-MPJPE指标上比现有最佳方法提高了10.9%,在MPJPE指标上提高了7.6%。代码已发布在 https://github.com/JinluZhang1126/MixSTE。
代码仓库
JinluZhang1126/MixSTE
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| 3d-human-pose-estimation-on-human36m | MixSTE (T=81 GT) | Average MPJPE (mm): 25.9 Multi-View or Monocular: Monocular Using 2D ground-truth joints: Yes |
| 3d-human-pose-estimation-on-human36m | MixSTE (T=243 GT) | Average MPJPE (mm): 21.6 Multi-View or Monocular: Monocular Using 2D ground-truth joints: Yes |
| 3d-human-pose-estimation-on-human36m | MixSTE (CPN, T=81) | Average MPJPE (mm): 42.4 Multi-View or Monocular: Monocular Using 2D ground-truth joints: No |
| 3d-human-pose-estimation-on-human36m | MixSTE (CPN, T=243) | Average MPJPE (mm): 40.9 Multi-View or Monocular: Monocular Using 2D ground-truth joints: No |
| 3d-human-pose-estimation-on-human36m | MixSTE (HRNet, T=243) | Average MPJPE (mm): 39.8 Multi-View or Monocular: Monocular Using 2D ground-truth joints: No |
| 3d-human-pose-estimation-on-humaneva-i | MixSTE (T=43, FT) | Mean Reconstruction Error (mm): 16.1 |
| 3d-human-pose-estimation-on-mpi-inf-3dhp | MixSTE (T=27) | AUC: 66.5 MPJPE: 54.9 PCK: 94.4 |
| 3d-human-pose-estimation-on-mpi-inf-3dhp | MixSTE (T=1) | AUC: 63.8 MPJPE: 57.9 PCK: 94.2 |
| classification-on-full-body-parkinsons | Mixste | F1-score (weighted): 0.41 |
| monocular-3d-human-pose-estimation-on-human3 | MixSTE (HRNet, T=243) | 2D detector: HRNet Average MPJPE (mm): 39.8 Frames Needed: 243 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |