Command Palette
Search for a command to run...
Large-scale weakly-supervised pre-training for video action recognition
Ghadiyaram Deepti ; Feiszli Matt ; Tran Du ; Yan Xueting ; Wang Heng ; Mahajan Dhruv

Abstract
Current fully-supervised video datasets consist of only a few hundredthousand videos and fewer than a thousand domain-specific labels. This hindersthe progress towards advanced video architectures. This paper presents anin-depth study of using large volumes of web videos for pre-training videomodels for the task of action recognition. Our primary empirical finding isthat pre-training at a very large scale (over 65 million videos), despite onnoisy social-media videos and hashtags, substantially improves thestate-of-the-art on three challenging public action recognition datasets.Further, we examine three questions in the construction of weakly-supervisedvideo action datasets. First, given that actions involve interactions withobjects, how should one construct a verb-object pre-training label space tobenefit transfer learning the most? Second, frame-based models perform quitewell on action recognition; is pre-training for good image features sufficientor is pre-training for spatio-temporal features valuable for optimal transferlearning? Finally, actions are generally less well-localized in long videos vs.short videos; since action labels are provided at a video level, how should onechoose video clips for best performance, given some fixed budget of number orminutes of videos?
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | irCSN-152 (IG-Kinetics-65M pretrain) | Acc@1: 82.8 |
| egocentric-activity-recognition-on-epic-1 | R(2+1)D-34 (kinetics) | Actions Top-1 (S2): 16.8 |
| egocentric-activity-recognition-on-epic-1 | R(2+1)D-152-SE (ig) | Actions Top-1 (S2): 25.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.