3 months ago

Contextual Action Cues from Camera Sensor for Multi-Stream Action Recognition

{Yong Won Hong Jongkwang Hong Bora Cho Hyeran Byun}

Abstract

In action recognition research, two primary types of information are appearance and motion information that is learned from RGB images through visual sensors. However, depending on the action characteristics, contextual information, such as the existence of specific objects or globally-shared information in the image, becomes vital information to define the action. For example, the existence of the ball is vital information distinguishing “kicking” from “running”. Furthermore, some actions share typical global abstract poses, which can be used as a key to classify actions. Based on these observations, we propose the multi-stream network model, which incorporates spatial, temporal, and contextual cues in the image for action recognition. We experimented on the proposed method using C3D or inflated 3D ConvNet (I3D) as a backbone network, regarding two different action recognition datasets. As a result, we observed overall improvement in accuracy, demonstrating the effectiveness of our proposed method.

Benchmarks

Benchmark	Methodology	Metrics
action-recognition-in-videos-on-hmdb-51	Multi-stream I3D	Average accuracy of 3 splits: 80.92
action-recognition-in-videos-on-ucf101	Multi-stream I3D	3-fold Accuracy: 97.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning