HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Mingxian Lin Wei Huang Yitang Li Chengjie Jiang Kui Wu Fangwei Zhong Shengju Qian Xin Wang Xiaojuan Qi

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Abstract

Recent advanced vision-language models(VLMs) have demonstrated strongperformance on passive, offline image and video understanding tasks. However,their effectiveness in embodied settings, which require online interaction andactive scene understanding remains limited. In such scenarios, an agentperceives the environment from a first-person perspective, with each actiondynamically shaping subsequent observations. Even state-of-the-art models suchas GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environmentinteractions, exhibiting clear limitations in spatial reasoning andlong-horizon planning. To address this gap, we introduce EmRACE-3K, a datasetof over 3,000 language-guided tasks situated in diverse, photorealisticenvironments constructed using Unreal Engine and the UnrealCV-Zoo framework.The tasks encompass a wide range of embodied challenges, including navigation,object manipulation, and multi-stage goal execution. Each task unfolds as amulti-step trajectory, pairing first-person visual observations with high-levelinstructions, grounded actions, and natural language rationales that expressthe agent's intent at every step. Using EmRACE-3K, we establish a benchmark toevaluate the embodied reasoning capabilities of VLMs across three keydimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stageGoal Execution. In zero-shot settings, all models achieve success rates below20%, underscoring the challenge posed by our benchmark and the currentlimitations of VLMs in interactive environments. To demonstrate the utility ofEmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learningfollowed by reinforcement learning. This approach yields substantialimprovements across all three challenge categories, highlighting the dataset'seffectiveness in enabling the development of embodied reasoning capabilities.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Papers | HyperAI