2 months ago

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang Yizhuo Li Tianshuo Yang Chengyue Wu Sitong Mao Liuao Pei Xiaokang Yang Jiangmiao Pang Yao Mu Ping Luo

Abstract

Vision-Language-Action (VLA) models adapt large vision-language backbones tomap images and instructions to robot actions. However, prevailing VLA decoderseither generate actions autoregressively in a fixed left-to-right order orattach continuous diffusion or flow matching heads outside the backbone,demanding specialized training and iterative sampling that hinder a unified,scalable architecture. We present Discrete Diffusion VLA, a single-transformerpolicy that models discretized action chunks with discrete diffusion and istrained with the same cross-entropy objective as the VLM backbone. The designretains diffusion's progressive refinement paradigm while remaining nativelycompatible with the discrete token interface of VLMs. Our method achieves anadaptive decoding order that resolves easy action elements before harder onesand uses secondary remasking to revisit uncertain predictions across refinementrounds, which improves consistency and enables robust error correction. Thisunified decoder preserves pretrained vision language priors, supports paralleldecoding, breaks the autoregressive bottleneck, and reduces the number offunction evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO,71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnvBridge, improving over both autoregressive and continuous diffusion baselines.These findings indicate that discrete-diffusion action decoder supports preciseaction modeling and consistent training, laying groundwork for scaling VLA tolarger models and datasets.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang Yizhuo Li Tianshuo Yang Chengyue Wu Sitong Mao Liuao Pei Xiaokang Yang Jiangmiao Pang Yao Mu Ping Luo

Abstract

Build AI with AI

Hyper Newsletters