Command Palette
Search for a command to run...
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo Lei Ji Ming Zhong Yang Chen Wen Lei Nan Duan Tianrui Li

Abstract
Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. We release our code at https://github.com/ArrowLuo/CLIP4Clip.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-to-video-retrieval-on-msr-vtt | CLIP4Clip | text-to-video R@1: 44.5 |
| video-retrieval-on-activitynet | CLIP4Clip | text-to-video Mean Rank: 7.5 text-to-video Median Rank: 2 text-to-video R@1: 40.5 text-to-video R@5: 73.4 text-to-video R@50: 98.2 |
| video-retrieval-on-didemo | CLIP4Clip | text-to-video Mean Rank: 17.5 text-to-video Median Rank: 2.0 text-to-video R@1: 43.4 text-to-video R@10: 80.6 text-to-video R@5: 70.2 |
| video-retrieval-on-lsmdc | CLIP4Clip | text-to-video Mean Rank: 58.0 text-to-video R@1: 21.6 text-to-video R@10: 49.8 text-to-video R@5: 41.8 |
| video-retrieval-on-msr-vtt | CLIP4Clip-seqTransf | text-to-video R@1: 44.5 text-to-video R@10: 81.6 text-to-video R@5: 71.4 |
| video-retrieval-on-msr-vtt-1ka | CLIP4Clip | text-to-video Mean Rank: 15.3 text-to-video Median Rank: 2 text-to-video R@10: 81.6 video-to-text Median Rank: 2 video-to-text R@1: 42.7 video-to-text R@10: 80.6 video-to-text R@5: 70.9 |
| video-retrieval-on-msvd | CLIP4Clip | text-to-video Mean Rank: 10.0 text-to-video Median Rank: 2 text-to-video R@1: 46.2 text-to-video R@10: 84.6 text-to-video R@5: 76.1 video-to-text Median Rank: 1 video-to-text R@1: 62.0 video-to-text R@10: 92.6 video-to-text R@5: 87.3 |
| zero-shot-video-retrieval-on-lsmdc | CLIP4Clip | text-to-video Mean Rank: 117 text-to-video Median Rank: 28 text-to-video R@1: 15.1 text-to-video R@10: 36.4 text-to-video R@5: 28.5 |
| zero-shot-video-retrieval-on-msr-vtt | CLIP4Clip | text-to-video Mean Rank: 34.0 text-to-video Median Rank: 4 text-to-video R@1: 32.0 text-to-video R@10: 66.9 text-to-video R@5: 57.0 |
| zero-shot-video-retrieval-on-msvd | CLIP4Clip | text-to-video Mean Rank: 17.8 text-to-video Median Rank: 2 text-to-video R@1: 38.5 text-to-video R@10: 76.8 text-to-video R@5: 66.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.