3 months ago

Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?

{Davide Mottin Matteo Lissandrini Dima Sivov Gil Lederman Eliezer Levy Nima Tavassoli Juan Manuel Rodriguez}

Abstract

Text-image retrieval (T2I) refers to the task of recovering all images relevant to a keyword query. Popular datasets for text-image retrieval, such as Flickr30k, VG, or MS-COCO, utilize annotated image captions, e.g., “a man playing with a kid”, as a surrogate for queries. With such surrogate queries, current multi-modal machine learning models, such as CLIP or BLIP, perform remarkably well. The main reason is the descriptive nature of captions, which detail the content of an image. Yet, T2I queries go beyond the mere descriptions in image-caption pairs. Thus, these datasets are ill-suited to test methods on more abstract or conceptual queries, e.g., “family vacations”. In such queries, the image content is implied rather than explicitly described. In this paper, we replicate the T2I results on descriptive queries and generalize them to conceptual queries. To this end, we perform new experiments on a novel T2I benchmark for the task of conceptual query answering, called ConQA. ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query. Our results on established measures show that both large pretrained models (e.g., CLIP, BLIP, and BLIP2) and small models (e.g., SGRAF and NAAF), perform up to 4x better on descriptive rather than conceptual queries. We also find that the models perform better on queries with more than 6 keywords as in MS-COCO captions.

Benchmarks

Benchmark	Methodology	Metrics
image-retrieval-on-conqa-conceptual	SGRAF	R-precision: 1.3 Recall@1: 0.0 Recall@10: 10.2 Recall@5: 8.2
image-retrieval-on-conqa-conceptual	BLIP 2	R-precision: 5.4 Recall@1: 8.2 Recall@10: 36.7 Recall@5: 28.6
image-retrieval-on-conqa-conceptual	BLIP	R-precision: 5.4 Recall@1: 4.1 Recall@10: 40.8 Recall@5: 28.6
image-retrieval-on-conqa-conceptual	NAAF	R-precision: 2.4 Recall@1: 4.1 Recall@10: 16.3 Recall@5: 12.2
image-retrieval-on-conqa-conceptual	CLIP	R-precision: 6.8 Recall@1: 12.2 Recall@10: 36.7 Recall@5: 30.6
image-retrieval-on-conqa-descriptive	SGRAF	R-precision: 7.9 Recall@1: 6.9 Recall@10: 34.5 Recall@5: 24.1
image-retrieval-on-conqa-descriptive	BLIP-2	R-precision: 15.3 Recall@1: 20.7 Recall@10: 62.1 Recall@5: 51.7
image-retrieval-on-conqa-descriptive	CLIP	R-precision: 16.5 Recall@1: 20.7 Recall@10: 65.5 Recall@5: 58.3
image-retrieval-on-conqa-descriptive	NAAF	R-precision: 10.6 Recall@1: 13.8 Recall@10: 44.8 Recall@5: 34.5
image-retrieval-on-conqa-descriptive	BLIP	R-precision: 15.3 Recall@1: 20.7 Recall@10: 62.1 Recall@5: 58.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning