a month ago

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Songsong Yu Yuxin Chen Hao Ju Lianjie Jia Fuxi Zhang Shaofei Huang Yuhan Wu Rundi Cui Binghao Ran Zaibin Zhang

Abstract

Visual Spatial Reasoning (VSR) is a core human cognitive ability and acritical requirement for advancing embodied intelligence and autonomoussystems. Despite recent progress in Vision-Language Models (VLMs), achievinghuman-level VSR remains highly challenging due to the complexity ofrepresenting and reasoning over three-dimensional space. In this paper, wepresent a systematic investigation of VSR in VLMs, encompassing a review ofexisting methodologies across input modalities, model architectures, trainingstrategies, and reasoning mechanisms. Furthermore, we categorize spatialintelligence into three levels of capability, ie, basic perception, spatialunderstanding, spatial planning, and curate SIBench, a spatial intelligencebenchmark encompassing nearly 20 open-source datasets across 23 task settings.Experiments with state-of-the-art VLMs reveal a pronounced gap betweenperception and reasoning, as models show competence in basic perceptual tasksbut consistently underperform in understanding and planning tasks, particularlyin numerical estimation, multi-view reasoning, temporal dynamics, and spatialimagination. These findings underscore the substantial challenges that remainin achieving spatial intelligence, while providing both a systematic roadmapand a comprehensive benchmark to drive future research in the field. Therelated resources of this study are accessible athttps://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Songsong Yu Yuxin Chen Hao Ju Lianjie Jia Fuxi Zhang Shaofei Huang Yuhan Wu Rundi Cui Binghao Ran Zaibin Zhang8 more

Abstract

Build AI with AI

Hyper Newsletters

Songsong Yu Yuxin Chen Hao Ju Lianjie Jia Fuxi Zhang Shaofei Huang Yuhan Wu Rundi Cui Binghao Ran Zaibin Zhang