HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Nitzan Bitton-Guetta Yonatan Bitton Jack Hessel Ludwig Schmidt Yuval Elovici Gabriel Stanovsky Roy Schwartz

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Abstract

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io

Benchmarks

BenchmarkMethodologyMetrics
explanation-generation-on-whoopsPredicted Caption -> GPT3
Human (%): 33
explanation-generation-on-whoopsBLIP2 FlanT5-XL (Fine-tuned)
Human (%): 15
explanation-generation-on-whoopsBLIP2 FlanT5-XXL (Fine-tuned)
Human (%): 27
explanation-generation-on-whoopsGround-truth Caption -> GPT3 (Oracle)
Human (%): 68
explanation-generation-on-whoopsBLIP2 FlanT5-XXL (Zero-shot)
Human (%): 0
image-captioning-on-whoopsOFA Large
BLEU-4: 0
CIDEr: 0
image-captioning-on-whoopsBLIP2 FlanT5-XXL (Fine-tuned)
BLEU-4: 42
CIDEr: 177
image-captioning-on-whoopsCoCa ViT-L-14 MSCOCO
BLEU-4: 25
CIDEr: 102
image-captioning-on-whoopsBLIP2 FlanT5-XXL (Zero-Shot)
BLEU-4: 31
CIDEr: 120
image-captioning-on-whoopsBLIP Large
BLEU-4: 13
CIDEr: 65
image-captioning-on-whoopsBLIP2 FlanT5-XL (Fine-tuned)
BLEU-4: 41
CIDEr: 174
image-to-text-retrieval-on-whoopsBLIP2 FlanT5-XXL (Text-only FT)
Specificity: 94
image-to-text-retrieval-on-whoopsBLIP2 FlanT5-XL (Fine-tuned)
Specificity: 81
image-to-text-retrieval-on-whoopsCoCa ViT-L-14 MSCOCO
Specificity: 72
image-to-text-retrieval-on-whoopsBLIP2 FlanT5-XXL (Zero-shot)
Specificity: 71
image-to-text-retrieval-on-whoopsBLIP2 FlanT5-XXL (Fine-tuned)
Specificity: 84
image-to-text-retrieval-on-whoopsBLIP Large
Specificity: 77
image-to-text-retrieval-on-whoopsCLIP ViT-L/14
Specificity: 70
visual-question-answering-vqa-on-whoopsBLIP2 FlanT5-XXL (Text-only FT)
BEM: 24
Exact Match: 4
visual-question-answering-vqa-on-whoopsBLIP2 FlanT5-XL (Fine-tuned)
BEM: 55
Exact Match: 20
visual-question-answering-vqa-on-whoopsOFA Large
BEM: 38
Exact Match: 8
visual-question-answering-vqa-on-whoopsBLIP Large
BEM: 39
Exact Match: 6
visual-question-answering-vqa-on-whoopsBLIP2 FlanT5-XXL (Zero-shot)
BEM: 55
Exact Match: 15
visual-question-answering-vqa-on-whoopsBLIP2 FlanT5-XXL (Fine-tuned)
BEM: 57
Exact Match: 21

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | Papers | HyperAI