Command Palette
Search for a command to run...
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Huanjin Yao Wenhao Wu Zhiheng Li

Abstract
Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | Side4Video (EVA, ViT-E/14) | Acc@1: 88.6 Acc@5: 98.2 |
| action-recognition-in-videos-on-something | Side4Video (EVA ViT-E/14) | Top-1 Accuracy: 75.2 Top-5 Accuracy: 94.0 |
| action-recognition-in-videos-on-something-1 | Side4Video (EVA ViT-E/14 | Top 1 Accuracy: 67.3 Top 5 Accuracy: 88.8 |
| video-retrieval-on-msr-vtt-1ka | Side4Video | text-to-video Mean Rank: 12.8 text-to-video Median Rank: 1.0 text-to-video R@1: 52.3 text-to-video R@10: 84.2 text-to-video R@5: 75.5 |
| video-retrieval-on-msvd | Side4Video | text-to-video Mean Rank: 8.4 text-to-video Median Rank: 1.0 text-to-video R@1: 56.1 text-to-video R@10: 88.8 text-to-video R@5: 81.7 |
| video-retrieval-on-vatex | Side4Video | text-to-video MedianR: 2.7 text-to-video R@1: 68.8 text-to-video R@10: 97.0 text-to-video R@5: 93.5 text-to-video R@50: 1.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.