21 days ago

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Yu Qi Haibo Zhao Ziyu Guo Siyuan Ma Ziyan Chen Yaokun Han Renrui Zhang Zitiantao Lin Shiji Xin Yijian Huang

Abstract

Embodied capabilities refer to a suite of fundamental abilities for an agentto perceive, comprehend, and interact with the physical world. While multimodallarge language models (MLLMs) show promise as embodied agents, a thorough andsystematic evaluation of their embodied capabilities remains underexplored, asexisting benchmarks primarily focus on specific domains such as planning orspatial understanding. To bridge this gap, we introduce BEAR, a comprehensiveand fine-grained benchmark that evaluates MLLMs on atomic embodiedcapabilities. BEAR comprises 4,469 interleaved image-video-text entries across14 domains in 6 categories, including tasks from low-level pointing, trajectoryunderstanding, spatial reasoning, to high-level planning. Extensive evaluationresults of 20 representative MLLMs reveal their persistent limitations acrossall domains of embodied capabilities. To tackle the shortfall, we proposeBEAR-Agent, a multimodal conversable agent that integrates pretrained visionmodels to strengthen MLLM perception, 3D understanding, and planningcapabilities. It substantially enhances MLLM performance across diverseembodied capabilities on BEAR, yielding a 9.12% absolute gain and a relativeimprovement of 17.5% on GPT-5. Furthermore, our experiments indicate thatimproving MLLM embodied capabilities can benefit embodied tasks in simulatedenvironments. Project website: https://bear-official66.github.io/

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Yu Qi Haibo Zhao Ziyu Guo Siyuan Ma Ziyan Chen Yaokun Han Renrui Zhang Zitiantao Lin Shiji Xin Yijian Huang10 more

Abstract

Build AI with AI

Hyper Newsletters

Yu Qi Haibo Zhao Ziyu Guo Siyuan Ma Ziyan Chen Yaokun Han Renrui Zhang Zitiantao Lin Shiji Xin Yijian Huang