Command Palette
Search for a command to run...
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Abstract
Embodied capabilities refer to a suite of fundamental abilities for an agentto perceive, comprehend, and interact with the physical world. While multimodallarge language models (MLLMs) show promise as embodied agents, a thorough andsystematic evaluation of their embodied capabilities remains underexplored, asexisting benchmarks primarily focus on specific domains such as planning orspatial understanding. To bridge this gap, we introduce BEAR, a comprehensiveand fine-grained benchmark that evaluates MLLMs on atomic embodiedcapabilities. BEAR comprises 4,469 interleaved image-video-text entries across14 domains in 6 categories, including tasks from low-level pointing, trajectoryunderstanding, spatial reasoning, to high-level planning. Extensive evaluationresults of 20 representative MLLMs reveal their persistent limitations acrossall domains of embodied capabilities. To tackle the shortfall, we proposeBEAR-Agent, a multimodal conversable agent that integrates pretrained visionmodels to strengthen MLLM perception, 3D understanding, and planningcapabilities. It substantially enhances MLLM performance across diverseembodied capabilities on BEAR, yielding a 9.12% absolute gain and a relativeimprovement of 17.5% on GPT-5. Furthermore, our experiments indicate thatimproving MLLM embodied capabilities can benefit embodied tasks in simulatedenvironments. Project website: https://bear-official66.github.io/
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.