8 months ago

Abstract

Skeleton-based action recognition, which classifies human actions based onthe coordinates of joints and their connectivity within skeleton data, iswidely utilized in various scenarios. While Graph Convolutional Networks (GCNs)have been proposed for skeleton data represented as graphs, they suffer fromlimited receptive fields constrained by joint connectivity. To address thislimitation, recent advancements have introduced transformer-based methods.However, capturing correlations between all joints in all frames requiressubstantial memory resources. To alleviate this, we propose a novel approachcalled Skeletal-Temporal Transformer (SkateFormer) that partitions joints andframes based on different types of skeletal-temporal relation (Skate-Type) andperforms skeletal-temporal self-attention (Skate-MSA) within each partition. Wecategorize the key skeletal-temporal relations for action recognition into atotal of four distinct types. These types combine (i) two skeletal relationtypes based on physically neighboring and distant joints, and (ii) two temporalrelation types based on neighboring and distant frames. Through thispartition-specific attention strategy, our SkateFormer can selectively focus onkey joints and frames crucial for action recognition in an action-adaptivemanner with efficient computation. Extensive experiments on various benchmarkdatasets validate that our SkateFormer outperforms recent state-of-the-artmethods.

Source PDF View Code