5 months ago

Vision-by-Language for Training-Free Compositional Image Retrieval

Shyamgopal Karthik; Karsten Roth; Massimiliano Mancini; Zeynep Akata

Abstract

Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

Code Repositories

explainableml/vision_by_language

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
zero-shot-composed-image-retrieval-zs-cir-on	CIReVL (CLIP B/32)	mAP@10: 15.42
zero-shot-composed-image-retrieval-zs-cir-on	CIReVL (CLIP L/14)	mAP@10: 19.01
zero-shot-composed-image-retrieval-zs-cir-on	CIReVL (CLIP G/14)	mAP@10: 27.59
zero-shot-composed-image-retrieval-zs-cir-on-1	CIReVL (CLIP G/14)	R@1: 34.65 R@5: 64.29
zero-shot-composed-image-retrieval-zs-cir-on-1	CIReVL (CLIP B/32)	R@1: 23.94 R@5: 52.51
zero-shot-composed-image-retrieval-zs-cir-on-1	CIReVL (CLIP L/14)	R@1: 24.55 R@5: 52.31
zero-shot-composed-image-retrieval-zs-cir-on-11	CIReVL (CLIP G/14)	A-R@1: 17.4
zero-shot-composed-image-retrieval-zs-cir-on-11	CIReVL (CLIP L/14)	A-R@1: 15.9
zero-shot-composed-image-retrieval-zs-cir-on-11	CIReVL (CLIP B/32)	A-R@1: 15.9
zero-shot-composed-image-retrieval-zs-cir-on-2	CIReVL (CLIP L/14)	(Recall@10+Recall@50)/2: 38.56
zero-shot-composed-image-retrieval-zs-cir-on-2	CIReVL (CLIP G/14)	(Recall@10+Recall@50)/2: 42.28
zero-shot-composed-image-retrieval-zs-cir-on-2	CIReVL (CLIP B/32)	(Recall@10+Recall@50)/2: 38.82

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette