HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

STM: SpatioTemporal and Motion Encoding for Action Recognition

Boyuan Jiang; Mengmeng Wang; Weihao Gan; Wei Wu; Junjie Yan

STM: SpatioTemporal and Motion Encoding for Action Recognition

Abstract

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400STM (ResNet-50)
Acc@1: 73.7
action-recognition-in-videos-on-hmdb-51-1STM (ImageNet+Kinetics pretrain)
Average accuracy of 3 splits: 72.2
action-recognition-in-videos-on-jester-1STM (Resnet-50, 16 frames)
Val: 96.7
action-recognition-in-videos-on-something-2STM (16 frames, ImageNet pretraining)
Top 1 Accuracy: 50.7
action-recognition-in-videos-on-something-3STM (16 frames, ImageNet pretraining)
Top-1 Accuracy: 64.2
Top-5 Accuracy: 89.8
action-recognition-in-videos-on-ucf101-2STM (ImageNet+Kinetics pretrain)
3-fold Accuracy: 96.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
STM: SpatioTemporal and Motion Encoding for Action Recognition | Papers | HyperAI