Command Palette
Search for a command to run...
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Zhixuan Liang Yizhuo Li Tianshuo Yang Chengyue Wu Sitong Mao Liuao Pei Xiaokang Yang Jiangmiao Pang Yao Mu Ping Luo

Abstract
Vision-Language-Action (VLA) models adapt large vision-language backbones tomap images and instructions to robot actions. However, prevailing VLA decoderseither generate actions autoregressively in a fixed left-to-right order orattach continuous diffusion or flow matching heads outside the backbone,demanding specialized training and iterative sampling that hinder a unified,scalable architecture. We present Discrete Diffusion VLA, a single-transformerpolicy that models discretized action chunks with discrete diffusion and istrained with the same cross-entropy objective as the VLM backbone. The designretains diffusion's progressive refinement paradigm while remaining nativelycompatible with the discrete token interface of VLMs. Our method achieves anadaptive decoding order that resolves easy action elements before harder onesand uses secondary remasking to revisit uncertain predictions across refinementrounds, which improves consistency and enables robust error correction. Thisunified decoder preserves pretrained vision language priors, supports paralleldecoding, breaks the autoregressive bottleneck, and reduces the number offunction evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO,71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnvBridge, improving over both autoregressive and continuous diffusion baselines.These findings indicate that discrete-diffusion action decoder supports preciseaction modeling and consistent training, laying groundwork for scaling VLA tolarger models and datasets.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.