Command Palette
Search for a command to run...
Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living
Reilly Dominick ; Das Srijan

Abstract
Video transformers have become the de facto standard for human actionrecognition, yet their exclusive reliance on the RGB modality still limitstheir adoption in certain domains. One such domain is Activities of DailyLiving (ADL), where RGB alone is not sufficient to distinguish between visuallysimilar actions, or actions observed from multiple viewpoints. To facilitatethe adoption of video transformers for ADL, we hypothesize that theaugmentation of RGB with human pose information, known for its sensitivity tofine-grained motion and multiple viewpoints, is essential. Consequently, weintroduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), anovel approach that augments the RGB representations learned by videotransformers with 2D and 3D pose information. The key elements of $\pi$-ViT aretwo plug-in modules, 2D Skeleton Induction Module and 3D Skeleton InductionModule, that are responsible for inducing 2D and 3D pose information into theRGB representations. These modules operate by performing pose-aware auxiliarytasks, a design choice that allows $\pi$-ViT to discard the modules duringinference. Notably, $\pi$-ViT achieves the state-of-the-art performance onthree prominent ADL datasets, encompassing both real-world and large-scaleRGB-D datasets, without requiring poses or additional computational overhead atinference.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-toyota-smarthome | π-ViT | CS: 72.9 CV1: 55.2 CV2: 64.8 |
| action-recognition-in-videos-on-ntu-rgbd | π-ViT (RGB + Pose) | Accuracy (CS): 96.3 Accuracy (CV): 99.0 |
| action-recognition-in-videos-on-ntu-rgbd | π-ViT (RGB only) | Accuracy (CS): 94.0 Accuracy (CV): 97.9 |
| action-recognition-in-videos-on-ntu-rgbd-120 | π-ViT (RGB only) | Accuracy (Cross-Setup): 91.9 Accuracy (Cross-Subject): 92.9 |
| action-recognition-in-videos-on-ntu-rgbd-120 | π-ViT (RGB + Pose) | Accuracy (Cross-Setup): 96.1 Accuracy (Cross-Subject): 95.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.