Command Palette
Search for a command to run...
Du Tran; Heng Wang; Lorenzo Torresani; Jamie Ray; Yann LeCun; Manohar Paluri

Abstract
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | R[2+1]D-RGB (Sports-1M pretrain) | Acc@1: 74.3 Acc@5: 91.4 |
| action-classification-on-kinetics-400 | R[2+1]D-RGB | Acc@1: 72 Acc@5: 90 |
| action-classification-on-kinetics-400 | R[2+1]D-Two-Stream | Acc@1: 73.9 Acc@5: 90.9 |
| action-classification-on-kinetics-400 | R[2+1]D | Acc@1: 72 Acc@5: 90 |
| action-classification-on-kinetics-400 | R[2+1]D-Flow | Acc@1: 67.5 Acc@5: 87.2 |
| action-classification-on-kinetics-400 | R[2+1]D-Flow (Sports-1M pretrain) | Acc@1: 75.4 Acc@5: 91.9 |
| action-recognition-in-videos-on-hmdb-51 | R[2+1]D-Flow (Kinetics pretrained) | Average accuracy of 3 splits: 76.4 |
| action-recognition-in-videos-on-hmdb-51 | R[2+1]D-RGB (Sports1M pretrained) | Average accuracy of 3 splits: 66.6 |
| action-recognition-in-videos-on-hmdb-51 | R[2+1]D-TwoStream (Kinetics pretrained) | Average accuracy of 3 splits: 78.7 |
| action-recognition-in-videos-on-hmdb-51 | R[2+1]D-RGB (Kinetics pretrained) | Average accuracy of 3 splits: 74.5 |
| action-recognition-in-videos-on-hmdb-51 | R[2+1D]D-TwoStream (Sports1M pretrained) | Average accuracy of 3 splits: 72.7 |
| action-recognition-in-videos-on-hmdb-51 | R[2+1]D-Flow (Sports1M pretrained) | Average accuracy of 3 splits: 70.1 |
| action-recognition-in-videos-on-sports-1m | R[2+1]D-Two-Stream-32frame | Video hit@1 : 73.3 Video hit@5: 91.9 |
| action-recognition-in-videos-on-sports-1m | R[2+1]D-RGB-32frame | Clip Hit@1: 57 Video hit@1 : 73 Video hit@5: 91.5 |
| action-recognition-in-videos-on-sports-1m | R[2+1]D-Flow-32frame | Clip Hit@1: 46.4 Video hit@1 : 68.4 Video hit@5: 88.7 |
| action-recognition-in-videos-on-ucf101 | R[2+1]D-Flow (Sports-1M pretrained) | 3-fold Accuracy: 93.3 |
| action-recognition-in-videos-on-ucf101 | R[2+1]D-RGB (Sports-1M pretrained) | 3-fold Accuracy: 93.6 |
| action-recognition-in-videos-on-ucf101 | R[2+1]D-Flow (Kinetics pretrained) | 3-fold Accuracy: 95.5 |
| action-recognition-in-videos-on-ucf101 | R[2+1]D-TwoStream (Kinetics pretrained) | 3-fold Accuracy: 97.3 |
| action-recognition-in-videos-on-ucf101 | R[2+1]D-RGB (Kinetics pretrained) | 3-fold Accuracy: 96.8 |
| action-recognition-in-videos-on-ucf101 | R[2+1]D-TwoStream (Sports-1M pretrained) | 3-fold Accuracy: 95 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.