Command Palette
Search for a command to run...
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs
Vaishnav Mohit ; Tammet Tanel

Abstract
A fundamental challenge in artificial intelligence involves understanding thecognitive mechanisms underlying visual reasoning in sophisticated models likeVision-Language Models (VLMs). How do these models integrate visual perceptionwith abstract thought, especially when reasoning across multiple images orrequiring fine-grained compositional understanding? Drawing inspiration fromcognitive science, this paper introduces a structured evaluation frameworkusing diverse visual reasoning tasks-Bongard Problems (BPs) and Winoground-todissect the perception-reasoning interface in VLMs. We propose three distinctevaluation paradigms, mirroring human problem-solving strategies: Direct VisualRule Learning (DVRL; holistic processing), Deductive Rule Learning (DRL; ruleextraction and application), and Componential Analysis (CA; analyticaldecomposition via task-agnostic textual descriptions). These paradigmssystematically vary cognitive load and probe processing stages. Notably, CAenables multi-image reasoning evaluation even for single-image architecturesand isolates reasoning from perception by operating on textual descriptions.Applying this framework, we demonstrate that CA, leveraging powerful languagemodels for reasoning over rich, independently generated descriptions, achievesnew state-of-the-art (SOTA) performance on challenging benchmarks includingBongard-OpenWorld, Bongard-HOI, and Winoground. Ablation studies confirmreasoning improves significantly when perceptual challenges are mitigated,revealing a critical perception bottleneck. Our framework provides a valuablediagnostic tool and suggests that decoupling perception (via rich,task-agnostic description) from reasoning is a promising direction for robustand general visual intelligence.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-reasoning-on-bongard-openworld | Componential analysis - gpt-4o | 2-Class Accuracy: 92.8 |
| visual-reasoning-on-bongard-openworld | componential analysis - gemini-2.0 | 2-Class Accuracy: 93.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.