3 months ago

LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships for Zero-Shot Action Recognition

{Arijit Sur Divyam Singal Sandipan Sarma}

Abstract

The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient “zero-shot” scene understanding. Pairing such models with transformers to implement temporal modeling has been rewarding for zero-shot action recognition (ZSAR). However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a ZSAR framework called LoCATe-GAT, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT). Specifically, image and text encodings extracted from a pretrained I-VL model are used as inputs for LoCATe-GAT. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multi-scale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on four widely-used benchmarks – UCF101, HMDB51, ActivityNet, and Kinetics – show we achieve state-of-the-art results. Specifically, we obtain relative gains of 3.8% and 4.8% on these datasets in conventional and 16.6% on UCF101in generalized ZSAR settings. For large-scale datasets like ActivityNet and Kinetics, our method achieves a relative gain of 31.8% and 27.9%, respectively, over the previous methods. Additionally, we gain 25.3% and 18.4%on UCF101 and HMDB51 as per the recent “TruZe” evaluation protocol.

Benchmarks

Benchmark	Methodology	Metrics
zero-shot-action-recognition-on-activitynet	LoCATe-GAT	Top-1 Accuracy: 73.8
zero-shot-action-recognition-on-hmdb51	LoCATe-GAT	Top-1 Accuracy: 50.7
zero-shot-action-recognition-on-kinetics	LoCATe-GAT	Top-1 Accuracy: 58.7
zero-shot-action-recognition-on-ucf101	LoCATe-GAT	Top-1 Accuracy: 76.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning