Command Palette
Search for a command to run...
Han Fang Pengfei Xiong Luhui Xu Yu Chen

Abstract
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-retrieval-on-msr-vtt | CLIP2Video | text-to-video Mean Rank: 45.4 text-to-video Median Rank: 4 text-to-video R@1: 29.8 text-to-video R@10: 66.2 text-to-video R@5: 55.5 video-to-text Mean Rank: 5.3 video-to-text Median Rank: 1 video-to-text R@1: 54.6 video-to-text R@10: 90.8 video-to-text R@5: 82.1 |
| video-retrieval-on-msr-vtt-1ka | CLIP2Video | text-to-video Mean Rank: 14.6 text-to-video Median Rank: 2 text-to-video R@1: 45.6 text-to-video R@10: 81.7 text-to-video R@5: 72.6 video-to-text Mean Rank: 10.2 video-to-text Median Rank: 2 video-to-text R@1: 43.3 video-to-text R@10: 82.1 video-to-text R@5: 72.3 |
| video-retrieval-on-vatex | CLIP2Video | text-to-video R@1: 57.3 text-to-video R@10: 90 text-to-video R@50: 95.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.