HyperAIHyperAI

Command Palette

Search for a command to run...

Poseidon: A ViT-based Architecture for Multi-Frame Pose Estimation with Adaptive Frame Weighting and Multi-Scale Feature Fusion

Cesare Davide Pace* Alessandro Marco De Nunzio Claudio De Stefano Francesco Fontanella Mario Molinara

Abstract

Human pose estimation, a vital task in computer vision, involves detectingand localising human joints in images and videos. While single-frame poseestimation has seen significant progress, it often fails to capture thetemporal dynamics for understanding complex, continuous movements. We proposePoseidon, a novel multi-frame pose estimation architecture that extends theViTPose model by integrating temporal information for enhanced accuracy androbustness to address these limitations. Poseidon introduces key innovations:(1) an Adaptive Frame Weighting (AFW) mechanism that dynamically prioritisesframes based on their relevance, ensuring that the model focuses on the mostinformative data; (2) a Multi-Scale Feature Fusion (MSFF) module thataggregates features from different backbone layers to capture both fine-graineddetails and high-level semantics; and (3) a Cross-Attention module foreffective information exchange between central and contextual frames, enhancingthe model's temporal coherence. The proposed architecture improves performancein complex video scenarios and offers scalability and computational efficiencysuitable for real-world applications. Our approach achieves state-of-the-artperformance on the PoseTrack21 and PoseTrack18 datasets, achieving mAP scoresof 88.3 and 87.8, respectively, outperforming existing methods.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp