HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Max Bain Arsha Nagrani Gül Varol Andrew Zisserman

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Abstract

Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.

Code Repositories

m-bain/clip-hitchhiker
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-action-recognition-on-charades-1CLIP-Hitchhiker (ViT-B/16, 32 frames)
mAP: 21.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
A CLIP-Hitchhiker's Guide to Long Video Retrieval | Papers | HyperAI