8 months ago

Abstract

The 3D pose estimation from a single image is a challenging problem due todepth ambiguity. One type of the previous methods lifts 2D joints, obtained byresorting to external 2D pose detectors, to the 3D space. However, this type ofapproaches discards the contextual information of images which are strong cuesfor 3D pose estimation. Meanwhile, some other methods predict the jointsdirectly from monocular images but adopt a 2.5D output representation $P^{2.5D}= (u,v,z^{r})$ where both $u$ and $v$ are in the image space but $z^{r}$ inroot-relative 3D space. Thus, the ground-truth information (e.g., the depth ofroot joint from the camera) is normally utilized to transform the 2.5D outputto the 3D space, which limits the applicability in practice. In this work, wepropose a novel end-to-end framework that not only exploits the contextualinformation but also produces the output directly in the 3D space via cascadeddimension-lifting. Specifically, we decompose the task of lifting pose from 2Dimage space to 3D spatial space into several sequential sub-tasks, 1) kinematicskeletons & individual joints estimation in 2D space, 2) root-relative depthestimation, and 3) lifting to the 3D space, each of which employs directsupervisions and contextual image features to guide the learning process.Extensive experiments show that the proposed framework achievesstate-of-the-art performance on two widely used 3D human pose datasets(Human3.6M, MuPoTS-3D).