Command Palette
Search for a command to run...
Large-Scale Video Classification with Convolutional Neural Networks
{Li Fei-Fei Rahul Sukthankar Thomas Leung George Toderici Sanketh Shetty Andrej Karpathy}

Abstract
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-recognition-in-videos-on-sports-1m | DeepVideo’s Slow Fusion | Clip Hit@1: 41.9 Video hit@1 : 60.9 Video hit@5: 80.2 |
| action-recognition-in-videos-on-ucf101 | Slow Fusion + Finetune top 3 layers | 3-fold Accuracy: 65.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.