Command Palette
Search for a command to run...
MotionBERT: A Unified Perspective on Learning Human Motion Representations
Zhu Wentao ; Ma Xiaoxuan ; Liu Zhaoyang ; Liu Libin ; Wu Wayne ; Wang Yizhou

Abstract
We present a unified perspective on tackling various human-centric videotasks by learning human motion representations from large-scale andheterogeneous data resources. Specifically, we propose a pretraining stage inwhich a motion encoder is trained to recover the underlying 3D motion fromnoisy partial 2D observations. The motion representations acquired in this wayincorporate geometric, kinematic, and physical knowledge about human motion,which can be easily transferred to multiple downstream tasks. We implement themotion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer)neural network. It could capture long-range spatio-temporal relationships amongthe skeletal joints comprehensively and adaptively, exemplified by the lowest3D pose estimation error so far when trained from scratch. Furthermore, ourproposed framework achieves state-of-the-art performance on all threedownstream tasks by simply finetuning the pretrained motion encoder with asimple regression head (1-2 layers), which demonstrates the versatility of thelearned motion representations. Code and models are available athttps://motionbert.github.io/
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-human-pose-estimation-on-3dpw | MotionBERT-HybrIK | MPJPE: 68.8 MPVPE: 79.4 PA-MPJPE: 40.6 |
| 3d-human-pose-estimation-on-3dpw | MotionBERT (Finetune) | MPJPE: 76.9 MPVPE: 88.1 PA-MPJPE: 47.2 |
| 3d-human-pose-estimation-on-human36m | MotionBERT (Finetune) | #Frames: 243 Average MPJPE (mm): 16.9 Multi-View or Monocular: Monocular Using 2D ground-truth joints: Yes |
| classification-on-full-body-parkinsons | MotionBERT | F1-score (weighted): 0.47 |
| classification-on-full-body-parkinsons | MotionBERT-LITE | F1-score (weighted): 0.43 |
| monocular-3d-human-pose-estimation-on-human3 | MotionBERT (Scratch) | 2D detector: SH Average MPJPE (mm): 39.2 Frames Needed: 243 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |
| monocular-3d-human-pose-estimation-on-human3 | MotionBERT (Finetune) | 2D detector: SH Average MPJPE (mm): 37.5 Frames Needed: 243 Need Ground Truth 2D Pose: No Use Video Sequence: Yes |
| one-shot-3d-action-recognition-on-ntu-rgbd | MotionBERT (Finetune) | Accuracy: 67.4% |
| skeleton-based-action-recognition-on-ntu-rgbd | MotionBert (finetune) | Accuracy (CS): 93.0 Accuracy (CV): 97.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.