Command Palette
Search for a command to run...
Xinrun Xu Pi Bu Ye Wang Börje F. Karlsson Ziming Wang Tengtao Song Qi Zhu Jun Song Zhiming Ding Bo Zheng

Abstract
Although Vision Language Models (VLMs) exhibit strong perceptual abilitiesand impressive visual reasoning, they struggle with attention to detail andprecise action planning in complex, dynamic environments, leading to subparperformance. Real-world tasks typically require complex interactions, advancedspatial reasoning, long-term planning, and continuous strategy refinement,usually necessitating understanding the physics rules of the target scenario.However, evaluating these capabilities in real-world scenarios is oftenprohibitively expensive. To bridge this gap, we introduce DeepPHY, a novelbenchmark framework designed to systematically evaluate VLMs' understanding andreasoning about fundamental physical principles through a series of challengingsimulated environments. DeepPHY integrates multiple physical reasoningenvironments of varying difficulty levels and incorporates fine-grainedevaluation metrics. Our evaluation finds that even state-of-the-art VLMsstruggle to translate descriptive physical knowledge into precise, predictivecontrol.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.