Command Palette
Search for a command to run...
Levy Matan ; Ben-Ari Rami ; Darshan Nir ; Lischinski Dani

Abstract
The task of Composed Image Retrieval (CoIR) involves queries that combineimage and text modalities, allowing users to express their intent moreeffectively. However, current CoIR datasets are orders of magnitude smallercompared to other vision and language (V&L) datasets. Additionally, some ofthese datasets have noticeable issues, such as queries containing redundantmodalities. To address these shortcomings, we introduce the Large ScaleComposed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten timeslarger than existing ones. Pre-training on our LaSCo, shows a noteworthyimprovement in performance, even in zero-shot. Furthermore, we propose a newapproach for analyzing CoIR datasets and methods, which detects modalityredundancy or necessity, in queries. We also introduce a new CoIR baseline, theCross-Attention driven Shift Encoder (CASE). This baseline allows for earlyfusion of modalities using a cross-attention module and employs an additionalauxiliary task during training. Our experiments demonstrate that this newbaseline outperforms the current state-of-the-art methods on establishedbenchmarks like FashionIQ and CIRR.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-retrieval-on-cirr | CASE (Pre-trained on LaSCo.Ca) | (Recall@5+Recall_subset@1)/2: 78.25 Recall@10: 88.75 |
| image-retrieval-on-cirr | CASE | (Recall@5+Recall_subset@1)/2: 77.5 Recall@10: 87.25 |
| image-retrieval-on-fashion-iq | CASE | (Recall@10+Recall@50)/2: 59.73 Recall@10: 48.79 |
| image-retrieval-on-lasco | BLIP4CIR | Recall@1 (%): 4.26 |
| image-retrieval-on-lasco | CASE | Recall@1 (%): 7.08 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.