Command Palette
Search for a command to run...
Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features
{Alberto del Bimbo Tiberio Uricchio Marco Bertini Alberto Baldrati}

Abstract
In this paper, we present an approach for conditioned and composed image retrieval based on CLIP features. In this extension of content-based image retrieval (CBIR), an image is combined with a text that provides information regarding user intentions and is relevant for application domains like e-commerce. The proposed method is based on an initial training stage where a simple combination of visual and textual features is used, to fine-tune the CLIP text encoder. Then in a second training stage, we learn a more complex combiner network that merges visual and textual features. Contrastive learning is used in both stages. The proposed approach obtains state-of-the-art performance for conditioned CBIR on the FashionIQ dataset and for composed CBIR on the more recent CIRR dataset.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-retrieval-on-cirr | CLIP4Cir (v2) | (Recall@5+Recall_subset@1)/2: 69.09 |
| image-retrieval-on-fashion-iq | CLIP4Cir (v2) | (Recall@10+Recall@50)/2: 50.03 |
| image-retrieval-on-lasco | CLIP4CIR | Recall@1 (%): 4.01 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.