3 months ago

Hierarchical Self-Attention Network for Action Localization in Videos

{ Wen-Hsien Fang Yie-Tarng Chen Rizard Renanda Adhi Pramono}

Abstract

This paper presents a novel Hierarchical Self-Attention Network (HISAN) to generate spatial-temporal tubes for action localization in videos. The essence of HISAN is to combine the two-stream convolutional neural network (CNN) with hierarchical bidirectional self-attention mechanism, which comprises of two levels of bidirectional self-attention to efficaciously capture both of the long-term temporal dependency information and spatial context information to render more precise action localization. Also, a sequence rescoring (SR) algorithm is employed to resolve the dilemma of inconsistent detection scores incurred by occlusion or background clutter. Moreover, a new fusion scheme is invoked, which integrates not only the appearance and motion information from the two-stream network, but also the motion saliency to mitigate the effect of camera motion. Simulations reveal that the new approach achieves competitive performance as the state-of-the-art works in terms of action localization and recognition accuracy on the widespread UCF101-24 and J-HMDB datasets.

Benchmarks

Benchmark	Methodology	Metrics
action-detection-on-j-hmdb	HISAN (VGG-16)	Frame-mAP 0.5: 76.72 Video-mAP 0.2: 85.97 Video-mAP 0.5: 84.02
action-detection-on-j-hmdb	HISAN (ResNet-101 + FPN)	Video-mAP 0.2: 87.59 Video-mAP 0.5: 86.49
action-detection-on-ucf101-24	HISAN (ResNet-101 + FPN)	Video-mAP 0.2: 82.30 Video-mAP 0.5: 51.47
action-detection-on-ucf101-24	HISAN (VGG-16)	Frame-mAP 0.5: 73.71 Video-mAP 0.2: 80.42 Video-mAP 0.5: 49.50

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning