HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Cross Modal Retrieval with Querybank Normalisation

Simion-Vlad Bogolin Ioana Croitoru Hailin Jin Yang Liu Samuel Albanie

Cross Modal Retrieval with Querybank Normalisation

Abstract

Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries. Drawing inspiration from the NLP literature, we formulate a simple but effective framework called Querybank Normalisation (QB-Norm) that re-normalises query similarities to account for hubs in the embedding space. QB-Norm improves retrieval performance without requiring retraining. Differently from prior work, we show that QB-Norm works effectively without concurrent access to any test set queries. Within the QB-Norm framework, we also propose a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches. We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. Code is available at https://vladbogo.github.io/QB-Norm/.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
metric-learning-on-stanford-online-products-1QB-Norm+RDML
R@1: 78.1
video-retrieval-on-didemoQB-Norm+CLIP4Clip
text-to-video Median Rank: 2.0
text-to-video R@1: 43.5
text-to-video R@10: 80.9
text-to-video R@5: 71.4
video-retrieval-on-lsmdcQB-Norm+CLIP4Clip
text-to-video Median Rank: 11.0
text-to-video R@1: 22.4
text-to-video R@10: 49.5
text-to-video R@5: 40.1
video-retrieval-on-msr-vtt-1kaQB-Norm+CLIP2Video
text-to-video Median Rank: 2
text-to-video R@1: 47.2
text-to-video R@10: 83.0
text-to-video R@5: 73.0
video-retrieval-on-msvdQB-Norm+CLIP2Video
text-to-video Median Rank: 2.0
text-to-video R@1: 48.0
text-to-video R@10: 86.2
text-to-video R@5: 77.9
video-retrieval-on-querydQB-Norm+TT-CE+
text-to-video R@1: 15.1
video-retrieval-on-vatexQB-Norm+CLIP2Video
text-to-video R@1: 58.8
text-to-video R@10: 93.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Cross Modal Retrieval with Querybank Normalisation | Papers | HyperAI