HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action
  Models

Abstract

Visual-Language-Action (VLA) models have emerged as a popular paradigm forlearning robot manipulation policies that can follow language instructions andgeneralize to novel scenarios. Recent work has begun to explore theincorporation of latent actions, an abstract representation of visual changebetween two frames, into VLA pre-training. In this paper, we introduce villa-X,a novel Visual-Language-Latent-Action (ViLLA) framework that advances latentaction modeling for learning generalizable robot manipulation policies. Ourapproach improves both how latent actions are learned and how they areincorporated into VLA pre-training. Together, these contributions enablevilla-X to achieve superior performance across simulated environments includingSIMPLER and LIBERO, as well as on two real-world robot setups including gripperand dexterous hand manipulation. We believe the ViLLA paradigm holdssignificant promise, and that our villa-X provides a strong foundation forfuture research.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models | Papers | HyperAI