Command Palette
Search for a command to run...
Yuxi Li Weiyao Lin Tao Wang John See Rui Qian Ning Xu Limin Wang Shugong Xu

Abstract
The task of spatial-temporal action detection has attracted increasing attention among researchers. Existing dominant methods solve this problem by relying on short-term information and dense serial-wise detection on each individual frames or clips. Despite their effectiveness, these methods showed inadequate use of long-term information and are prone to inefficiency. In this paper, we propose for the first time, an efficient framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner. There are two key characteristics in this framework: (1) Both long-term and short-term sampled information are explicitly utilized in our spatiotemporal network, (2) A new dynamic feature sampling module (DTS) is designed to effectively approximate the tube output while keeping the system tractable. We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets, achieving promising results that are competitive to state-of-the-art methods. The proposed sparse-to-dense strategy rendered our framework about 7.6 times more efficient than the nearest competitor.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-detection-on-j-hmdb | DTS | Video-mAP 0.2: 76.1 Video-mAP 0.5: 74.3 |
| action-detection-on-ucf-sports | DTS | Video-mAP 0.2: 94.3 Video-mAP 0.5: 93.8 |
| action-detection-on-ucf101-24 | DTS | Video-mAP 0.5: 54 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.