HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Vision-by-Language for Training-Free Compositional Image Retrieval

Shyamgopal Karthik; Karsten Roth; Massimiliano Mancini; Zeynep Akata

Vision-by-Language for Training-Free Compositional Image Retrieval

Abstract

Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

Code Repositories

explainableml/vision_by_language
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-composed-image-retrieval-zs-cir-onCIReVL (CLIP B/32)
mAP@10: 15.42
zero-shot-composed-image-retrieval-zs-cir-onCIReVL (CLIP L/14)
mAP@10: 19.01
zero-shot-composed-image-retrieval-zs-cir-onCIReVL (CLIP G/14)
mAP@10: 27.59
zero-shot-composed-image-retrieval-zs-cir-on-1CIReVL (CLIP G/14)
R@1: 34.65
R@5: 64.29
zero-shot-composed-image-retrieval-zs-cir-on-1CIReVL (CLIP B/32)
R@1: 23.94
R@5: 52.51
zero-shot-composed-image-retrieval-zs-cir-on-1CIReVL (CLIP L/14)
R@1: 24.55
R@5: 52.31
zero-shot-composed-image-retrieval-zs-cir-on-11CIReVL (CLIP G/14)
A-R@1: 17.4
zero-shot-composed-image-retrieval-zs-cir-on-11CIReVL (CLIP L/14)
A-R@1: 15.9
zero-shot-composed-image-retrieval-zs-cir-on-11CIReVL (CLIP B/32)
A-R@1: 15.9
zero-shot-composed-image-retrieval-zs-cir-on-2CIReVL (CLIP L/14)
(Recall@10+Recall@50)/2: 38.56
zero-shot-composed-image-retrieval-zs-cir-on-2CIReVL (CLIP G/14)
(Recall@10+Recall@50)/2: 42.28
zero-shot-composed-image-retrieval-zs-cir-on-2CIReVL (CLIP B/32)
(Recall@10+Recall@50)/2: 38.82

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Vision-by-Language for Training-Free Compositional Image Retrieval | Papers | HyperAI