HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

Huanjin Yao Wenhao Wu Zhiheng Li

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

Abstract

Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.

Code Repositories

whwu95/ATM
pytorch
Mentioned in GitHub
HJYao00/Side4Video
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400Side4Video (EVA, ViT-E/14)
Acc@1: 88.6
Acc@5: 98.2
action-recognition-in-videos-on-somethingSide4Video (EVA ViT-E/14)
Top-1 Accuracy: 75.2
Top-5 Accuracy: 94.0
action-recognition-in-videos-on-something-1Side4Video (EVA ViT-E/14
Top 1 Accuracy: 67.3
Top 5 Accuracy: 88.8
video-retrieval-on-msr-vtt-1kaSide4Video
text-to-video Mean Rank: 12.8
text-to-video Median Rank: 1.0
text-to-video R@1: 52.3
text-to-video R@10: 84.2
text-to-video R@5: 75.5
video-retrieval-on-msvdSide4Video
text-to-video Mean Rank: 8.4
text-to-video Median Rank: 1.0
text-to-video R@1: 56.1
text-to-video R@10: 88.8
text-to-video R@5: 81.7
video-retrieval-on-vatexSide4Video
text-to-video MedianR: 2.7
text-to-video R@1: 68.8
text-to-video R@10: 97.0
text-to-video R@5: 93.5
text-to-video R@50: 1.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | Papers | HyperAI