HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VPN: Learning Video-Pose Embedding for Activities of Daily Living

Srijan Das Saurav Sharma Rui Dai Francois Bremond Monique Thonnat

VPN: Learning Video-Pose Embedding for Activities of Daily Living

Abstract

In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.

Code Repositories

srijandas07/VPN
Official
tf

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-toyota-smarthomeVPN (RGB + Pose)
CS: 60.8
CV1: 43.8
CV2: 53.5
action-recognition-in-videos-on-ntu-rgbdVPN (RGB + Pose)
Accuracy (CS): 95.5
Accuracy (CV): 98.0
action-recognition-in-videos-on-ntu-rgbd-120VPN (RGB + Pose)
Accuracy (Cross-Setup): 86.3
Accuracy (Cross-Subject): 87.8
skeleton-based-action-recognition-on-n-uclaVPN (RGB + Pose)
Accuracy: 93.5
skeleton-based-action-recognition-on-ntu-rgbd-1VPN
Accuracy (Cross-Setup): 87.8
Accuracy (Cross-Subject): 86.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VPN: Learning Video-Pose Embedding for Activities of Daily Living | Papers | HyperAI