Command Palette
Search for a command to run...
Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition
Wen Yuhang ; Tang Zixuan ; Pang Yunsheng ; Ding Beichen ; Liu Mengyuan

Abstract
Recognizing interactive action plays an important role in human-robotinteraction and collaboration. Previous methods use late fusion andco-attention mechanism to capture interactive relations, which have limitedlearning capability or inefficiency to adapt to more interacting entities. Withassumption that priors of each entity are already known, they also lackevaluations on a more general setting addressing the diversity of subjects. Toaddress these problems, we propose an Interactive Spatiotemporal TokenAttention Network (ISTA-Net), which simultaneously model spatial, temporal, andinteractive relations. Specifically, our network contains a tokenizer topartition Interactive Spatiotemporal Tokens (ISTs), which is a unified way torepresent motions of multiple diverse entities. By extending the entitydimension, ISTs provide better interactive representations. To jointly learnalong three dimensions in ISTs, multi-head self-attention blocks integratedwith 3D convolutions are designed to capture inter-token correlations. Whenmodeling correlations, a strict entity ordering is usually irrelevant forrecognizing interactive actions. To this end, Entity Rearrangement is proposedto eliminate the orderliness in ISTs for interchangeable entities. Extensiveexperiments on four datasets verify the effectiveness of ISTA-Net byoutperforming state-of-the-art methods. Our code is publicly available athttps://github.com/Necolizer/ISTA-Net
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-action-recognition-on-assembly101 | ISTA-Net | Actions Top-1: 28.07 Object Top-1: 31.69 Verbs Top-1: 62.66 |
| action-recognition-on-h2o-2-hands-and-objects | ISTA-Net | Actions Top-1: 89.09 Hand Pose: 3D Object Label: No Object Pose: Yes RGB: No |
| human-interaction-recognition-on-ntu-rgb-d-1 | ISTA-Net | Accuracy (Cross-Setup): 91.7 Accuracy (Cross-Subject): 90.5 |
| human-interaction-recognition-on-sbu | ISTA-Net | Accuracy: 98.51±1.47 |
| skeleton-based-action-recognition-on-h2o-2 | ISTA-Net | Accuracy: 89.09±1.21 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.