Command Palette
Search for a command to run...
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Abstract
Recent advances in vision-language-action (VLA) models have shown promise inintegrating image generation with action prediction to improve generalizationand reasoning in robot manipulation. However, existing methods are limited tochallenging image-based forecasting, which suffers from redundant informationand lacks comprehensive and critical world knowledge, including dynamic,spatial and semantic information. To address these limitations, we proposeDreamVLA, a novel VLA framework that integrates comprehensive world knowledgeforecasting to enable inverse dynamics modeling, thereby establishing aperception-prediction-action loop for manipulation tasks. Specifically,DreamVLA introduces a dynamic-region-guided world knowledge prediction,integrated with the spatial and semantic cues, which provide compact yetcomprehensive representations for action planning. This design aligns with howhumans interact with the world by first forming abstract multimodal reasoningchains before acting. To mitigate interference among the dynamic, spatial andsemantic information during training, we adopt a block-wise structuredattention mechanism that masks their mutual attention, preventing informationleakage and keeping each representation clean and disentangled. Moreover, tomodel the conditional distribution over future actions, we employ adiffusion-based transformer that disentangles action representations fromshared latent features. Extensive experiments on both real-world and simulationenvironments demonstrate that DreamVLA achieves 76.7% success rate on realrobot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
Code Repositories
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.