Command Palette
Search for a command to run...
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Abstract
Vision-Language-Action (VLA) models enable embodied decision-making but relyheavily on imitation learning, leading to compounding errors and poorrobustness under distribution shift. Reinforcement learning (RL) can mitigatethese issues yet typically demands costly real-world interactions or suffersfrom sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuningframework that leverages a data-driven world model as a controllable simulator.Trained from real interaction data, the simulator predicts future visualobservations conditioned on actions, allowing policy rollouts with dense,trajectory-level rewards derived from goal-achieving references. This designdelivers an efficient and action-aligned learning signal, drastically loweringsample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpassesstrong supervised baselines and achieves greater efficiency thansimulator-based RL. Moreover, it exhibits strong robustness under perturbedconditions, sustaining stable task execution. Our results establishworld-model-based RFT as a practical post-training paradigm to enhance thegeneralization and robustness of VLA models. For more details, please refer tohttps://vla-rft.github.io/.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.