HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

What You See is What You Read? Improving Text-Image Alignment Evaluation

Michal Yarom Yonatan Bitton Soravit Changpinyo Roee Aharoni Jonathan Herzig Oran Lang Eran Ofek Idan Szpektor

What You See is What You Read? Improving Text-Image Alignment Evaluation

Abstract

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

Code Repositories

yonatanbitton/wysiwyr
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
visual-reasoning-on-winogroundCOCA ViT-L14 (f.t on COCO)
Group Score: 8.25
Image Score: 11.50
Text Score: 28.25
visual-reasoning-on-winogroundTIFA
Group Score: 11.30
Image Score: 12.50
Text Score: 19.00
visual-reasoning-on-winogroundVQ2
Group Score: 30.5
Image Score: 42.2
Text Score: 47
visual-reasoning-on-winogroundPaLI (ft SNLI-VE + Synthetic Data)
Group Score: 28.75
Image Score: 38
Text Score: 46.5
visual-reasoning-on-winogroundPaLI (ft SNLI-VE)
Group Score: 28.70
Image Score: 41.50
Text Score: 45.00
visual-reasoning-on-winogroundBLIP2 (ft COCO)
Group Score: 23.50
Image Score: 26.00
Text Score: 44.00
visual-reasoning-on-winogroundCLIP RN50x64
Group Score: 10.25
Image Score: 13.75
Text Score: 26.50
visual-reasoning-on-winogroundOFA large (ft SNLI-VE)
Group Score: 9.00
Image Score: 14.30
Text Score: 27.70

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
What You See is What You Read? Improving Text-Image Alignment Evaluation | Papers | HyperAI