Command Palette
Search for a command to run...
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
Fuhao Li Wenxuan Song Han Zhao Jingbo Wang Pengxiang Ding Donglin Wang Long Zeng Haoang Li

Abstract
Vision-language-action (VLA) models have recently shown strong potential inenabling robots to follow language instructions and execute precise actions.However, most VLAs are built upon vision-language models pretrained solely on2D data, which lack accurate spatial awareness and hinder their ability tooperate in the 3D physical world. Existing solutions attempt to incorporateexplicit 3D sensor inputs such as depth maps or point clouds, but theseapproaches face challenges due to sensor noise, hardware heterogeneity, andincomplete depth coverage in existing datasets. Alternative methods thatestimate 3D cues from 2D images also suffer from the limited performance ofdepth estimators.We propose Spatial Forcing (SF), a simple yet effectivealignment strategy that implicitly forces VLA models to develop spatialcomprehension capabilities without relying on explicit 3D inputs or depthestimators. SF aligns intermediate visual embeddings of VLAs with geometricrepresentations produced by pretrained 3D foundation models. By enforcingalignment at intermediate layers, SF guides VLAs to encode richer spatialrepresentations that enhance action precision.Extensive experiments insimulation and real-world environments demonstrate that SF achievesstate-of-the-art results, surpassing both 2D- and 3D-based VLAs. SF furtheraccelerates training by up to 3.8x and improves data efficiency across diverserobotic tasks. Project page is at https://spatial-forcing.github.io/
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.