Command Palette
Search for a command to run...
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Abstract
Visual-Language-Action (VLA) models have emerged as a popular paradigm forlearning robot manipulation policies that can follow language instructions andgeneralize to novel scenarios. Recent work has begun to explore theincorporation of latent actions, an abstract representation of visual changebetween two frames, into VLA pre-training. In this paper, we introduce villa-X,a novel Visual-Language-Latent-Action (ViLLA) framework that advances latentaction modeling for learning generalizable robot manipulation policies. Ourapproach improves both how latent actions are learned and how they areincorporated into VLA pre-training. Together, these contributions enablevilla-X to achieve superior performance across simulated environments includingSIMPLER and LIBERO, as well as on two real-world robot setups including gripperand dexterous hand manipulation. We believe the ViLLA paradigm holdssignificant promise, and that our villa-X provides a strong foundation forfuture research.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.