Command Palette
Search for a command to run...
Simion-Vlad Bogolin Ioana Croitoru Hailin Jin Yang Liu Samuel Albanie

Abstract
Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries. Drawing inspiration from the NLP literature, we formulate a simple but effective framework called Querybank Normalisation (QB-Norm) that re-normalises query similarities to account for hubs in the embedding space. QB-Norm improves retrieval performance without requiring retraining. Differently from prior work, we show that QB-Norm works effectively without concurrent access to any test set queries. Within the QB-Norm framework, we also propose a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches. We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. Code is available at https://vladbogo.github.io/QB-Norm/.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| metric-learning-on-stanford-online-products-1 | QB-Norm+RDML | R@1: 78.1 |
| video-retrieval-on-didemo | QB-Norm+CLIP4Clip | text-to-video Median Rank: 2.0 text-to-video R@1: 43.5 text-to-video R@10: 80.9 text-to-video R@5: 71.4 |
| video-retrieval-on-lsmdc | QB-Norm+CLIP4Clip | text-to-video Median Rank: 11.0 text-to-video R@1: 22.4 text-to-video R@10: 49.5 text-to-video R@5: 40.1 |
| video-retrieval-on-msr-vtt-1ka | QB-Norm+CLIP2Video | text-to-video Median Rank: 2 text-to-video R@1: 47.2 text-to-video R@10: 83.0 text-to-video R@5: 73.0 |
| video-retrieval-on-msvd | QB-Norm+CLIP2Video | text-to-video Median Rank: 2.0 text-to-video R@1: 48.0 text-to-video R@10: 86.2 text-to-video R@5: 77.9 |
| video-retrieval-on-queryd | QB-Norm+TT-CE+ | text-to-video R@1: 15.1 |
| video-retrieval-on-vatex | QB-Norm+CLIP2Video | text-to-video R@1: 58.8 text-to-video R@10: 93.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.