Command Palette
Search for a command to run...
{Cordelia Schmid Xiaojiang Peng}

Abstract
We propose a multi-region two-stream R-CNN model for action detection in realistic videos. We start from frame-level action detection based on faster R-CNN [1], and make three contributions: (1) we show that a motion region proposal network generates high-quality proposals , which are complementary to those of an appearance region proposal network; (2) we show that stacking optical flow over several frames significantly improves frame-level action detection; and (3) we embed a multi-region scheme in the faster R-CNN model, which adds complementary information on body parts. We then link frame-level detections with the Viterbi algorithm, and temporally localize an action with the maximum subarray method. Experimental results on the UCF-Sports, J-HMDB and UCF101 action detection datasets show that our approach outperforms the state of the art with a significant margin in both frame-mAP and video-mAP
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-detection-on-j-hmdb | MR-TS R-CNN | Frame-mAP 0.5: 58.5 Video-mAP 0.2: 74.3 Video-mAP 0.5: 73.09 |
| action-detection-on-j-hmdb | TS R-CNN | Frame-mAP 0.5: 56.9 Video-mAP 0.2: 71.1 Video-mAP 0.5: 70.6 |
| action-detection-on-ucf-sports | MR-TS R-CNN | Frame-mAP 0.5: 84.52 Video-mAP 0.2: 94.83 Video-mAP 0.5: 94.67 |
| action-detection-on-ucf-sports | TS R-CNN | Frame-mAP 0.5: 82.30 Video-mAP 0.2: 94.82 Video-mAP 0.5: 94.82 |
| action-detection-on-ucf101-24 | MR-TS R-CNN | Frame-mAP 0.5: 39.63 |
| action-detection-on-ucf101-24 | TS R-CNN | Frame-mAP 0.5: 39.94 |
| action-recognition-in-videos-on-ucf101 | MR Two-Sream R-CNN | 3-fold Accuracy: 91.1 |
| skeleton-based-action-recognition-on-j-hmdb | MR Two-Sream R-CNN | Accuracy (RGB+pose): 71.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.