3 months ago

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Max Bain Arsha Nagrani Gül Varol Andrew Zisserman

Abstract

Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.

Code Repositories

m-bain/clip-hitchhiker

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
zero-shot-action-recognition-on-charades-1	CLIP-Hitchhiker (ViT-B/16, 32 frames)	mAP: 21.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette