HyperAIHyperAI

Command Palette

Search for a command to run...

21 days ago

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic
  Embodied Capabilities

Abstract

Embodied capabilities refer to a suite of fundamental abilities for an agentto perceive, comprehend, and interact with the physical world. While multimodallarge language models (MLLMs) show promise as embodied agents, a thorough andsystematic evaluation of their embodied capabilities remains underexplored, asexisting benchmarks primarily focus on specific domains such as planning orspatial understanding. To bridge this gap, we introduce BEAR, a comprehensiveand fine-grained benchmark that evaluates MLLMs on atomic embodiedcapabilities. BEAR comprises 4,469 interleaved image-video-text entries across14 domains in 6 categories, including tasks from low-level pointing, trajectoryunderstanding, spatial reasoning, to high-level planning. Extensive evaluationresults of 20 representative MLLMs reveal their persistent limitations acrossall domains of embodied capabilities. To tackle the shortfall, we proposeBEAR-Agent, a multimodal conversable agent that integrates pretrained visionmodels to strengthen MLLM perception, 3D understanding, and planningcapabilities. It substantially enhances MLLM performance across diverseembodied capabilities on BEAR, yielding a 9.12% absolute gain and a relativeimprovement of 17.5% on GPT-5. Furthermore, our experiments indicate thatimproving MLLM embodied capabilities can benefit embodied tasks in simulatedenvironments. Project website: https://bear-official66.github.io/

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities | Papers | HyperAI