HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Holistic Features are almost Sufficient for Text-to-Video Retrieval

{Xirong Li Bangxiang Lan Zijie Xin Ruixiang Zhao Kaibin Tian}

Holistic Features are almost Sufficient for Text-to-Video Retrieval

Abstract

For text-to-video retrieval (T2VR) which aims to retrieve unlabeled videos by ad-hoc textual queries CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip yet has near-SOTA effectiveness.

Benchmarks

BenchmarkMethodologyMetrics
video-retrieval-on-msr-vtt-1kaTeachCLIP (ViT-B/16)
text-to-video R@1: 48.0
text-to-video R@10: 83.5
text-to-video R@5: 75.9
video-retrieval-on-msr-vtt-1kaTeachCLIP
text-to-video R@1: 46.8
text-to-video R@10: 82.6
text-to-video R@5: 74.3
video-retrieval-on-vatexTeachCLIP
text-to-video R@1: 63.6
text-to-video R@10: 96.1
text-to-video R@5: 91.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Holistic Features are almost Sufficient for Text-to-Video Retrieval | Papers | HyperAI