HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Hongwei Xue Yuchong Sun Bei Liu Jianlong Fu Ruihua Song Houqiang Li Jiebo Luo

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Abstract

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

Code Repositories

microsoft/xpretrain
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
video-retrieval-on-activitynetCLIP-ViP
text-to-video Median Rank: 1
text-to-video R@1: 61.4
text-to-video R@10: 92.6
text-to-video R@5: 85.7
video-retrieval-on-didemoCLIP-ViP
text-to-video Median Rank: 1
text-to-video R@1: 55.3
text-to-video R@10: 89.3
text-to-video R@5: 82
video-retrieval-on-lsmdcCLIP-ViP
text-to-video Median Rank: 5
text-to-video R@1: 30.7
text-to-video R@10: 60.6
text-to-video R@5: 51.4
video-retrieval-on-msr-vtt-1kaCLIP-ViP
text-to-video Median Rank: 1.0
text-to-video R@1: 57.7
text-to-video R@10: 88.2
text-to-video R@5: 80.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment | Papers | HyperAI