
摘要
我们提出了一种统一的方法,通过从大规模和异构数据资源中学习人体运动表示来解决各种以人类为中心的视频任务。具体而言,我们设计了一个预训练阶段,在该阶段中,运动编码器被训练用于从有噪声的部分2D观测中恢复潜在的3D运动。通过这种方式获得的运动表示融合了几何、运动学和物理知识,可以轻松迁移到多个下游任务中。我们使用双流时空变换器(Dual-stream Spatio-temporal Transformer, DSTformer)神经网络实现了这一运动编码器。该网络能够全面且自适应地捕捉骨骼关节之间的长程时空关系,从零开始训练时表现出迄今为止最低的3D姿态估计误差。此外,我们的框架仅需通过简单的回归头(1-2层)对预训练的运动编码器进行微调,便在所有三个下游任务上达到了最先进的性能,这证明了所学运动表示的多功能性。代码和模型可在 https://motionbert.github.io/ 获取。
代码仓库
Walter0807/MotionBERT
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| 3d-human-pose-estimation-on-3dpw | MotionBERT-HybrIK | MPJPE: 68.8 MPVPE: 79.4 PA-MPJPE: 40.6 |
| 3d-human-pose-estimation-on-3dpw | MotionBERT (Finetune) | MPJPE: 76.9 MPVPE: 88.1 PA-MPJPE: 47.2 |
| 3d-human-pose-estimation-on-human36m | MotionBERT (Finetune) | #Frames: 243 Average MPJPE (mm): 16.9 Multi-View or Monocular: Monocular Using 2D ground-truth joints: Yes |
| classification-on-full-body-parkinsons | MotionBERT | F1-score (weighted): 0.47 |
| classification-on-full-body-parkinsons | MotionBERT-LITE | F1-score (weighted): 0.43 |
| monocular-3d-human-pose-estimation-on-human3 | MotionBERT (Scratch) | 2D detector: SH Average MPJPE (mm): 39.2 Frames Needed: 243 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |
| monocular-3d-human-pose-estimation-on-human3 | MotionBERT (Finetune) | 2D detector: SH Average MPJPE (mm): 37.5 Frames Needed: 243 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |
| one-shot-3d-action-recognition-on-ntu-rgbd | MotionBERT (Finetune) | Accuracy: 67.4% |
| skeleton-based-action-recognition-on-ntu-rgbd | MotionBert (finetune) | Accuracy (CS): 93.0 Accuracy (CV): 97.2 |