Command Palette
Search for a command to run...
{ Wen-Hsien Fang Yie-Tarng Chen Rizard Renanda Adhi Pramono}

Abstract
This paper presents a novel Hierarchical Self-Attention Network (HISAN) to generate spatial-temporal tubes for action localization in videos. The essence of HISAN is to combine the two-stream convolutional neural network (CNN) with hierarchical bidirectional self-attention mechanism, which comprises of two levels of bidirectional self-attention to efficaciously capture both of the long-term temporal dependency information and spatial context information to render more precise action localization. Also, a sequence rescoring (SR) algorithm is employed to resolve the dilemma of inconsistent detection scores incurred by occlusion or background clutter. Moreover, a new fusion scheme is invoked, which integrates not only the appearance and motion information from the two-stream network, but also the motion saliency to mitigate the effect of camera motion. Simulations reveal that the new approach achieves competitive performance as the state-of-the-art works in terms of action localization and recognition accuracy on the widespread UCF101-24 and J-HMDB datasets.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-detection-on-j-hmdb | HISAN (VGG-16) | Frame-mAP 0.5: 76.72 Video-mAP 0.2: 85.97 Video-mAP 0.5: 84.02 |
| action-detection-on-j-hmdb | HISAN (ResNet-101 + FPN) | Video-mAP 0.2: 87.59 Video-mAP 0.5: 86.49 |
| action-detection-on-ucf101-24 | HISAN (ResNet-101 + FPN) | Video-mAP 0.2: 82.30 Video-mAP 0.5: 51.47 |
| action-detection-on-ucf101-24 | HISAN (VGG-16) | Frame-mAP 0.5: 73.71 Video-mAP 0.2: 80.42 Video-mAP 0.5: 49.50 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.