Command Palette
Search for a command to run...
Baldrati Alberto ; Agnolucci Lorenzo ; Bertini Marco ; Del Bimbo Alberto

Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image based on aquery composed of a reference image and a relative caption that describes thedifference between the two images. The high effort and cost required forlabeling datasets for CIR hamper the widespread usage of existing methods, asthey rely on supervised learning. In this work, we propose a new task,Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeledtraining dataset. Our approach, named zero-Shot composEd imAge Retrieval withtextuaL invErsion (SEARLE), maps the visual features of the reference imageinto a pseudo-word token in CLIP token embedding space and integrates it withthe relative caption. To support research on ZS-CIR, we introduce anopen-domain benchmarking dataset named Composed Image Retrieval on CommonObjects in context (CIRCO), which is the first dataset for CIR containingmultiple ground truths for each query. The experiments show that SEARLEexhibits better performance than the baselines on the two main datasets for CIRtasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code andthe model are publicly available at https://github.com/miccunifi/SEARLE.
Code Repositories
Benchmarks
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.