8 months ago

Multimodal Representation

Computer Vision

Computer Vision

Baldrati Alberto ; Agnolucci Lorenzo ; Bertini Marco ; Del Bimbo Alberto

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on aquery composed of a reference image and a relative caption that describes thedifference between the two images. The high effort and cost required forlabeling datasets for CIR hamper the widespread usage of existing methods, asthey rely on supervised learning. In this work, we propose a new task,Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeledtraining dataset. Our approach, named zero-Shot composEd imAge Retrieval withtextuaL invErsion (SEARLE), maps the visual features of the reference imageinto a pseudo-word token in CLIP token embedding space and integrates it withthe relative caption. To support research on ZS-CIR, we introduce anopen-domain benchmarking dataset named Composed Image Retrieval on CommonObjects in context (CIRCO), which is the first dataset for CIR containingmultiple ground truths for each query. The experiments show that SEARLEexhibits better performance than the baselines on the two main datasets for CIRtasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code andthe model are publicly available at https://github.com/miccunifi/SEARLE.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Computer Vision

Computer Vision

Baldrati Alberto ; Agnolucci Lorenzo ; Bertini Marco ; Del Bimbo Alberto

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on aquery composed of a reference image and a relative caption that describes thedifference between the two images. The high effort and cost required forlabeling datasets for CIR hamper the widespread usage of existing methods, asthey rely on supervised learning. In this work, we propose a new task,Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeledtraining dataset. Our approach, named zero-Shot composEd imAge Retrieval withtextuaL invErsion (SEARLE), maps the visual features of the reference imageinto a pseudo-word token in CLIP token embedding space and integrates it withthe relative caption. To support research on ZS-CIR, we introduce anopen-domain benchmarking dataset named Composed Image Retrieval on CommonObjects in context (CIRCO), which is the first dataset for CIR containingmultiple ground truths for each query. The experiments show that SEARLEexhibits better performance than the baselines on the two main datasets for CIRtasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code andthe model are publicly available at https://github.com/miccunifi/SEARLE.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp