3 months ago

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen Hangxing Wei Pushi Zhang Chuheng Zhang Kaixin Wang Yanjiang Guo Rushuai Yang Yucen Wang Xinquan Xiao Li Zhao

Abstract

Visual-Language-Action (VLA) models have emerged as a popular paradigm forlearning robot manipulation policies that can follow language instructions andgeneralize to novel scenarios. Recent work has begun to explore theincorporation of latent actions, an abstract representation of visual changebetween two frames, into VLA pre-training. In this paper, we introduce villa-X,a novel Visual-Language-Latent-Action (ViLLA) framework that advances latentaction modeling for learning generalizable robot manipulation policies. Ourapproach improves both how latent actions are learned and how they areincorporated into VLA pre-training. Together, these contributions enablevilla-X to achieve superior performance across simulated environments includingSIMPLER and LIBERO, as well as on two real-world robot setups including gripperand dexterous hand manipulation. We believe the ViLLA paradigm holdssignificant promise, and that our villa-X provides a strong foundation forfuture research.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen Hangxing Wei Pushi Zhang Chuheng Zhang Kaixin Wang Yanjiang Guo Rushuai Yang Yucen Wang Xinquan Xiao Li Zhao2 more

Abstract

Build AI with AI

Hyper Newsletters

Xiaoyu Chen Hangxing Wei Pushi Zhang Chuheng Zhang Kaixin Wang Yanjiang Guo Rushuai Yang Yucen Wang Xinquan Xiao Li Zhao