Command Palette
Search for a command to run...
What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom Yonatan Bitton Soravit Changpinyo Roee Aharoni Jonathan Herzig Oran Lang Eran Ofek Idan Szpektor

Abstract
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-reasoning-on-winoground | COCA ViT-L14 (f.t on COCO) | Group Score: 8.25 Image Score: 11.50 Text Score: 28.25 |
| visual-reasoning-on-winoground | TIFA | Group Score: 11.30 Image Score: 12.50 Text Score: 19.00 |
| visual-reasoning-on-winoground | VQ2 | Group Score: 30.5 Image Score: 42.2 Text Score: 47 |
| visual-reasoning-on-winoground | PaLI (ft SNLI-VE + Synthetic Data) | Group Score: 28.75 Image Score: 38 Text Score: 46.5 |
| visual-reasoning-on-winoground | PaLI (ft SNLI-VE) | Group Score: 28.70 Image Score: 41.50 Text Score: 45.00 |
| visual-reasoning-on-winoground | BLIP2 (ft COCO) | Group Score: 23.50 Image Score: 26.00 Text Score: 44.00 |
| visual-reasoning-on-winoground | CLIP RN50x64 | Group Score: 10.25 Image Score: 13.75 Text Score: 26.50 |
| visual-reasoning-on-winoground | OFA large (ft SNLI-VE) | Group Score: 9.00 Image Score: 14.30 Text Score: 27.70 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.