Command Palette
Search for a command to run...
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
Yuanmin Tang Xiaoting Qin Jue Zhang Jing Yu Gaopeng Gou Gang Xiong Qingwei Ling Saravan Rajmohan Dongmei Zhang Qi Wu

Abstract
Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code will be available at https://github.com/Pter61/osrcir2024/.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-composed-image-retrieval-zs-cir-on | OSrCIR (CLIP L/14) | mAP@10: 25.33 |
| zero-shot-composed-image-retrieval-zs-cir-on | OSrCIR (CLIP G/14) | mAP@10: 31.14 |
| zero-shot-composed-image-retrieval-zs-cir-on | OSrCIR (CLIP B/32) | mAP@10: 19.17 |
| zero-shot-composed-image-retrieval-zs-cir-on-1 | OSrCIR (CLIP L/14) | R@5: 57.68 |
| zero-shot-composed-image-retrieval-zs-cir-on-1 | OSrCIR (CLIP G/14) | R@5: 67.25 |
| zero-shot-composed-image-retrieval-zs-cir-on-1 | OSrCIR (CLIP B/32) | R@5: 54.54 |
| zero-shot-composed-image-retrieval-zs-cir-on-11 | OSrCIR (CLIP B/32) | A-R@1: 17.4 |
| zero-shot-composed-image-retrieval-zs-cir-on-11 | OSrCIR (CLIP L/14) | A-R@1: 17.9 |
| zero-shot-composed-image-retrieval-zs-cir-on-11 | OSrCIR (CLIP G/14) | A-R@1: 19.6 |
| zero-shot-composed-image-retrieval-zs-cir-on-2 | OSrCIR (CLIP B/32) | (Recall@10+Recall@50)/2: 42.87 |
| zero-shot-composed-image-retrieval-zs-cir-on-2 | OSrCIR (CLIP G/14) | (Recall@10+Recall@50)/2: 47.34 |
| zero-shot-composed-image-retrieval-zs-cir-on-2 | OSrCIR (CLIP L/14) | (Recall@10+Recall@50)/2: 42.82 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.