Command Palette
Search for a command to run...
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Abstract
Visual Spatial Reasoning (VSR) is a core human cognitive ability and acritical requirement for advancing embodied intelligence and autonomoussystems. Despite recent progress in Vision-Language Models (VLMs), achievinghuman-level VSR remains highly challenging due to the complexity ofrepresenting and reasoning over three-dimensional space. In this paper, wepresent a systematic investigation of VSR in VLMs, encompassing a review ofexisting methodologies across input modalities, model architectures, trainingstrategies, and reasoning mechanisms. Furthermore, we categorize spatialintelligence into three levels of capability, ie, basic perception, spatialunderstanding, spatial planning, and curate SIBench, a spatial intelligencebenchmark encompassing nearly 20 open-source datasets across 23 task settings.Experiments with state-of-the-art VLMs reveal a pronounced gap betweenperception and reasoning, as models show competence in basic perceptual tasksbut consistently underperform in understanding and planning tasks, particularlyin numerical estimation, multi-view reasoning, temporal dynamics, and spatialimagination. These findings underscore the substantial challenges that remainin achieving spatial intelligence, while providing both a systematic roadmapand a comprehensive benchmark to drive future research in the field. Therelated resources of this study are accessible athttps://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.