Command Palette
Search for a command to run...
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Xu Hu ; Ghosh Gargi ; Huang Po-Yao ; Okhonko Dmytro ; Aghajanyan Armen ; Metze Florian ; Zettlemoyer Luke ; Feichtenhofer Christoph

Abstract
We present VideoCLIP, a contrastive approach to pre-train a unified model forzero-shot video and text understanding, without using any labels on downstreamtasks. VideoCLIP trains a transformer for video and text by contrastingtemporally overlapping positive video-text pairs with hard negatives fromnearest neighbor retrieval. Our experiments on a diverse series of downstreamtasks, including sequence-level text-video retrieval, VideoQA, token-levelaction localization, and action segmentation reveal state-of-the-artperformance, surpassing prior work, and in some cases even outperformingsupervised approaches. Code is made available athttps://github.com/pytorch/fairseq/tree/main/examples/MMPT.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-segmentation-on-coin | VideoClip | Frame accuracy: 68.7 |
| long-video-retrieval-background-removed-on | VideoCLIP | Cap. Avg. R@1: 74.5 Cap. Avg. R@10: 97.9 Cap. Avg. R@5: 94.5 DTW R@1: 56.0 DTW R@10: 89.9 DTW R@5: 96.3 OTAM R@1: 52.8 OTAM R@10: 89.2 OTAM R@5: 95.0 |
| temporal-action-localization-on-crosstask | VideoCLIP | Recall: 47.3 |
| temporal-relation-extraction-on-vinoground | VideoCLIP | Group Score: 1.2 Text Score: 17 Video Score: 2.8 |
| video-retrieval-on-msr-vtt-1ka | VideoCLIP | text-to-video R@1: 30.9 text-to-video R@10: 66.8 text-to-video R@5: 55.4 |
| video-retrieval-on-youcook2 | VideoCLIP | text-to-video R@1: 32.2 text-to-video R@10: 75.0 text-to-video R@5: 62.6 |
| video-retrieval-on-youcook2 | VideoCLIP (zero-shot) | text-to-video R@1: 22.7 text-to-video R@10: 63.1 text-to-video R@5: 50.4 |
| zero-shot-video-retrieval-on-didemo | VideoCLIP | text-to-video R@1: 16.6 text-to-video R@5: 46.9 |
| zero-shot-video-retrieval-on-msr-vtt | VideoCLIP | text-to-video R@1: 10.4 text-to-video R@10: 30.0 text-to-video R@5: 22.2 |
| zero-shot-video-retrieval-on-youcook2 | VideoCLIP | text-to-video R@1: 22.7 text-to-video R@10: 63.1 text-to-video R@5: 50.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.