HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Zero-Shot Composed Image Retrieval with Textual Inversion

Baldrati Alberto ; Agnolucci Lorenzo ; Bertini Marco ; Del Bimbo Alberto

Zero-Shot Composed Image Retrieval with Textual Inversion

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on aquery composed of a reference image and a relative caption that describes thedifference between the two images. The high effort and cost required forlabeling datasets for CIR hamper the widespread usage of existing methods, asthey rely on supervised learning. In this work, we propose a new task,Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeledtraining dataset. Our approach, named zero-Shot composEd imAge Retrieval withtextuaL invErsion (SEARLE), maps the visual features of the reference imageinto a pseudo-word token in CLIP token embedding space and integrates it withthe relative caption. To support research on ZS-CIR, we introduce anopen-domain benchmarking dataset named Composed Image Retrieval on CommonObjects in context (CIRCO), which is the first dataset for CIR containingmultiple ground truths for each query. The experiments show that SEARLEexhibits better performance than the baselines on the two main datasets for CIRtasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code andthe model are publicly available at https://github.com/miccunifi/SEARLE.

Code Repositories

miccunifi/searle
Official
pytorch
Mentioned in GitHub
miccunifi/circo
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-composed-image-retrieval-zs-cir-onSEARLE-XL (CLIP L/14)
mAP@10: 12.73
zero-shot-composed-image-retrieval-zs-cir-onSEARLE (CLIP B/32)
mAP@10: 9.94
zero-shot-composed-image-retrieval-zs-cir-on-1SEARLE
R@5: 53.42
zero-shot-composed-image-retrieval-zs-cir-on-1SEARLE-XL
R@5: 52.48
zero-shot-composed-image-retrieval-zs-cir-on-11SEARLE (CLIP B/32)
A-R@1: 14.4
zero-shot-composed-image-retrieval-zs-cir-on-11SEARLE (CLIP L/14)
A-R@1: 14.4
zero-shot-composed-image-retrieval-zs-cir-on-2SEARLE (CLIP B/32)
(Recall@10+Recall@50)/2: 32.71
zero-shot-composed-image-retrieval-zs-cir-on-2SEARLE-XL-OTI (CLIP L/14)
(Recall@10+Recall@50)/2: 37.76
zero-shot-composed-image-retrieval-zs-cir-on-2SEARLE-XL (CLIP L/14)
(Recall@10+Recall@50)/2: 35.90
zero-shot-composed-image-retrieval-zs-cir-on-2SEARLE-OTI (CLIP B/32)
(Recall@10+Recall@50)/2: 32.39
zero-shot-composed-image-retrieval-zs-cir-on-3SEARLE-XL-OTI
R@10: 27.61
zero-shot-composed-image-retrieval-zs-cir-on-4SEARLE (CLIP B/32)
Actions Recall@5: 24.58
zero-shot-composed-image-retrieval-zs-cir-on-4SEARLE-OTI (CLIP B/32)
Actions Recall@5: 26.00
zero-shot-composed-image-retrieval-zs-cir-on-4SEARLE-XL-OTI (CLIP L/14)
Actions Recall@5: 31.43
zero-shot-composed-image-retrieval-zs-cir-on-4SEARLE-XL (CLIP L/14)
Actions Recall@5: 29.02
zero-shot-composed-image-retrieval-zs-cir-on-5SEARLE-OTI (CLIP B/32)
Average Recall: 12.77
zero-shot-composed-image-retrieval-zs-cir-on-5SEARLE-XL-OTI (CLIP B/32)
Average Recall: 20.42
zero-shot-composed-image-retrieval-zs-cir-on-5SEARLE-XL (CLIP L/14)
Average Recall: 21.54
zero-shot-composed-image-retrieval-zs-cir-on-5SEARLE (CLIP B/32)
Average Recall: 11.94
zero-shot-composed-image-retrieval-zs-cir-on-6SEARLE-OTI (CLIP B/32)
(Recall@10+Recall@50)/2: 12.77
zero-shot-composed-image-retrieval-zs-cir-on-6SEARLE-XL-OTI (CLIP B/32)
(Recall@10+Recall@50)/2: 20.42
zero-shot-composed-image-retrieval-zs-cir-on-6SEARLE (CLIP B/32)
(Recall@10+Recall@50)/2: 11.94
zero-shot-composed-image-retrieval-zs-cir-on-6SEARLE-XL (CLIP L/14)
(Recall@10+Recall@50)/2: 21.54

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Zero-Shot Composed Image Retrieval with Textual Inversion | Papers | HyperAI