3 months ago

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Xinrun Xu Pi Bu Ye Wang Börje F. Karlsson Ziming Wang Tengtao Song Qi Zhu Jun Song Zhiming Ding Bo Zheng

Abstract

Although Vision Language Models (VLMs) exhibit strong perceptual abilitiesand impressive visual reasoning, they struggle with attention to detail andprecise action planning in complex, dynamic environments, leading to subparperformance. Real-world tasks typically require complex interactions, advancedspatial reasoning, long-term planning, and continuous strategy refinement,usually necessitating understanding the physics rules of the target scenario.However, evaluating these capabilities in real-world scenarios is oftenprohibitively expensive. To bridge this gap, we introduce DeepPHY, a novelbenchmark framework designed to systematically evaluate VLMs' understanding andreasoning about fundamental physical principles through a series of challengingsimulated environments. DeepPHY integrates multiple physical reasoningenvironments of varying difficulty levels and incorporates fine-grainedevaluation metrics. Our evaluation finds that even state-of-the-art VLMsstruggle to translate descriptive physical knowledge into precise, predictivecontrol.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Xinrun Xu Pi Bu Ye Wang Börje F. Karlsson Ziming Wang Tengtao Song Qi Zhu Jun Song Zhiming Ding Bo Zheng

Abstract

Build AI with AI

Hyper Newsletters