| ViTPose (ViTAE-G, ensemble) | 81.1 | 95.0 | 88.2 | 86.0 | 77.8 | 85.6 | ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation | |
| ViTPose (ViTAE-G) | 80.9 | 94.8 | 88.1 | 85.9 | 77.5 | 85.4 | ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation | |
| UDP-Pose-PSA(384x288) | 79.5 | 93.6 | 85.9 | 84.3 | 76.3 | 81.9 | Polarized Self-Attention: Towards High-quality Pixel-wise Regression | |
| 4xRSN-50 (ensemble) | 79.2 | 94.4 | 87.1 | 76.1 | 83.8 | 84.1 | Learning Delicate Local Representations for Multi-Person Pose Estimation | |
| UDP-Pose-PSA(256x192) | 78.9 | 93.6 | 85.8 | 83.6 | 76.1 | 81.4 | Polarized Self-Attention: Towards High-quality Pixel-wise Regression | |
| PCT (256x256) | 78.3 | 92.9 | 85.9 | - | - | - | Human Pose as Compositional Tokens | |
| HRNet-W48 + extra data | 77 | 92.7 | 84.5 | 83.1 | 73.4 | 82 | Deep High-Resolution Representation Learning for Human Pose Estimation | |
| OmniPose (WASPv2) | 76.4 | 92.6 | 83.7 | 82.6 | 72.6 | 81.2 | OmniPose: A Multi-Scale Framework for Multi-Person Pose Estimation | |
| HRFormer-B | 76.2 | 92.7 | 83.8 | 82.3 | 72.5 | 81.2 | HRFormer: High-Resolution Transformer for Dense Prediction | |
| PPE (ResNeXt-101) | 75.7 | 90.3 | 76.3 | 79.5 | 80.7 | - | Deep Multi-Task Networks For Occluded Pedestrian Pose Estimation | - |
| TransPose-H-A6 | 75 | 92.2 | 82.3 | 81.1 | 71.3 | - | TransPose: Keypoint Localization via Transformer | |