HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

CLIP2TV: Align, Match and Distill for Video-Text Retrieval

Zijian Gao Jingyu Liu Weiqi Sun Sheng Chen Dedan Chang Lili Zhao

CLIP2TV: Align, Match and Distill for Video-Text Retrieval

Abstract

Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. In this report, we present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods. To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations. Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.

Benchmarks

BenchmarkMethodologyMetrics
video-retrieval-on-msr-vttCLIP2TV
text-to-video Mean Rank: 44.7
text-to-video Median Rank: 3
text-to-video R@1: 33.1
text-to-video R@10: 68.9
text-to-video R@5: 58.9
video-retrieval-on-msr-vtt-1kaCLIP2TV
text-to-video Mean Rank: 12.8
text-to-video Median Rank: 1
text-to-video R@1: 52.9
text-to-video R@10: 86.5
text-to-video R@5: 78.5
video-to-text Mean Rank: 9.0
video-to-text Median Rank: 1
video-to-text R@1: 54.1
video-to-text R@10: 85.7
video-to-text R@5: 77.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
CLIP2TV: Align, Match and Distill for Video-Text Retrieval | Papers | HyperAI