HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Haz Sameen Shahgir Khondker Salman Sayeed Abhik Bhattacharjee Wasi Uddin Ahmad Yue Dong Rifat Shahriyar

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Abstract

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

Code Repositories

csebuetnlp/illusionvqa
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
object-localization-on-illusionvqaGPT4-Vision 4-shot+CoT
Accuracy: 49.7
object-localization-on-illusionvqaGPT4-Vision
Accuracy: 40
object-localization-on-illusionvqaGemini-Pro 4-shot
Accuracy: 41.8
object-localization-on-illusionvqaInstructBLIP-13B
Accuracy: 24.3
object-localization-on-illusionvqaLLaVA-1.5-13B
Accuracy: 24.8
object-localization-on-illusionvqaCogVLM
Accuracy: 28
object-localization-on-illusionvqaGemini-Pro
Accuracy: 43.5
object-localization-on-illusionvqaGPT4-Vision 4-shot
Accuracy: 46
object-localization-on-illusionvqaGemini-Pro 4-shot+CoT
Accuracy: 33.9
visual-question-answering-vqa-on-illusionvqaGPT4-Vision
Accuracy: 58.85
visual-question-answering-vqa-on-illusionvqaLLaVA-1.5-13B
Accuracy: 40
visual-question-answering-vqa-on-illusionvqaGemini-Pro
Accuracy: 51.26
visual-question-answering-vqa-on-illusionvqaInstructBLIP-13B
Accuracy: 34.25
visual-question-answering-vqa-on-illusionvqaGemini-Pro 4-shot
Accuracy: 52.87
visual-question-answering-vqa-on-illusionvqaCogVLM
Accuracy: 38.16
visual-question-answering-vqa-on-illusionvqaGPT4-Vision 4-shot
Accuracy: 62.99

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Papers | HyperAI