Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model forzero-shot video and text understanding, without using any labels on downstreamtasks. VideoCLIP trains a transformer for video and text by contrastingtemporally overlapping positive video-text pairs with hard negatives fromnearest neighbor retrieval. Our experiments on a diverse series of downstreamtasks, including sequence-level text-video retrieval, VideoQA, token-levelaction localization, and action segmentation reveal state-of-the-artperformance, surpassing prior work, and in some cases even outperformingsupervised approaches. Code is made available athttps://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Source PDF View Code