HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Zhang Jinlu ; Tu Zhigang ; Yang Jianyu ; Chen Yujin ; Yuan Junsong

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
  Estimation in Video

Abstract

Recent transformer-based solutions have been introduced to estimate 3D humanpose from 2D keypoint sequence by considering body joints among all framesglobally to learn spatio-temporal correlation. We observe that the motions ofdifferent joints differ significantly. However, the previous methods cannotefficiently model the solid inter-frame correspondence of each joint, leadingto insufficient learning of spatial-temporal correlation. We propose MixSTE(Mixed Spatio-Temporal Encoder), which has a temporal transformer block toseparately model the temporal motion of each joint and a spatial transformerblock to learn inter-joint spatial correlation. These two blocks are utilizedalternately to obtain better spatio-temporal feature encoding. In addition, thenetwork output is extended from the central frame to entire frames of the inputvideo, thereby improving the coherence between the input and output sequences.Extensive experiments are conducted on three benchmarks (Human3.6M,MPI-INF-3DHP, and HumanEva). The results show that our model outperforms thestate-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code isavailable at https://github.com/JinluZhang1126/MixSTE.

Code Repositories

JinluZhang1126/MixSTE
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
3d-human-pose-estimation-on-human36mMixSTE (T=81 GT)
Average MPJPE (mm): 25.9
Multi-View or Monocular: Monocular
Using 2D ground-truth joints: Yes
3d-human-pose-estimation-on-human36mMixSTE (T=243 GT)
Average MPJPE (mm): 21.6
Multi-View or Monocular: Monocular
Using 2D ground-truth joints: Yes
3d-human-pose-estimation-on-human36mMixSTE (CPN, T=81)
Average MPJPE (mm): 42.4
Multi-View or Monocular: Monocular
Using 2D ground-truth joints: No
3d-human-pose-estimation-on-human36mMixSTE (CPN, T=243)
Average MPJPE (mm): 40.9
Multi-View or Monocular: Monocular
Using 2D ground-truth joints: No
3d-human-pose-estimation-on-human36mMixSTE (HRNet, T=243)
Average MPJPE (mm): 39.8
Multi-View or Monocular: Monocular
Using 2D ground-truth joints: No
3d-human-pose-estimation-on-humaneva-iMixSTE (T=43, FT)
Mean Reconstruction Error (mm): 16.1
3d-human-pose-estimation-on-mpi-inf-3dhpMixSTE (T=27)
AUC: 66.5
MPJPE: 54.9
PCK: 94.4
3d-human-pose-estimation-on-mpi-inf-3dhpMixSTE (T=1)
AUC: 63.8
MPJPE: 57.9
PCK: 94.2
classification-on-full-body-parkinsonsMixste
F1-score (weighted): 0.41
monocular-3d-human-pose-estimation-on-human3MixSTE (HRNet, T=243)
2D detector: HRNet
Average MPJPE (mm): 39.8
Frames Needed: 243
Need Ground Truth 2D Pose: No
Use Video Sequence: Yes

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video | Papers | HyperAI