HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living

Reilly Dominick ; Das Srijan

Just Add $\pi$! Pose Induced Video Transformers for Understanding
  Activities of Daily Living

Abstract

Video transformers have become the de facto standard for human actionrecognition, yet their exclusive reliance on the RGB modality still limitstheir adoption in certain domains. One such domain is Activities of DailyLiving (ADL), where RGB alone is not sufficient to distinguish between visuallysimilar actions, or actions observed from multiple viewpoints. To facilitatethe adoption of video transformers for ADL, we hypothesize that theaugmentation of RGB with human pose information, known for its sensitivity tofine-grained motion and multiple viewpoints, is essential. Consequently, weintroduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), anovel approach that augments the RGB representations learned by videotransformers with 2D and 3D pose information. The key elements of $\pi$-ViT aretwo plug-in modules, 2D Skeleton Induction Module and 3D Skeleton InductionModule, that are responsible for inducing 2D and 3D pose information into theRGB representations. These modules operate by performing pose-aware auxiliarytasks, a design choice that allows $\pi$-ViT to discard the modules duringinference. Notably, $\pi$-ViT achieves the state-of-the-art performance onthree prominent ADL datasets, encompassing both real-world and large-scaleRGB-D datasets, without requiring poses or additional computational overhead atinference.

Code Repositories

dominickrei/pi-vit
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-toyota-smarthomeπ-ViT
CS: 72.9
CV1: 55.2
CV2: 64.8
action-recognition-in-videos-on-ntu-rgbdπ-ViT (RGB + Pose)
Accuracy (CS): 96.3
Accuracy (CV): 99.0
action-recognition-in-videos-on-ntu-rgbdπ-ViT (RGB only)
Accuracy (CS): 94.0
Accuracy (CV): 97.9
action-recognition-in-videos-on-ntu-rgbd-120π-ViT (RGB only)
Accuracy (Cross-Setup): 91.9
Accuracy (Cross-Subject): 92.9
action-recognition-in-videos-on-ntu-rgbd-120π-ViT (RGB + Pose)
Accuracy (Cross-Setup): 96.1
Accuracy (Cross-Subject): 95.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living | Papers | HyperAI