HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified
  Rewards in World Simulators

Abstract

Vision-Language-Action (VLA) models enable embodied decision-making but relyheavily on imitation learning, leading to compounding errors and poorrobustness under distribution shift. Reinforcement learning (RL) can mitigatethese issues yet typically demands costly real-world interactions or suffersfrom sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuningframework that leverages a data-driven world model as a controllable simulator.Trained from real interaction data, the simulator predicts future visualobservations conditioned on actions, allowing policy rollouts with dense,trajectory-level rewards derived from goal-achieving references. This designdelivers an efficient and action-aligned learning signal, drastically loweringsample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpassesstrong supervised baselines and achieves greater efficiency thansimulator-based RL. Moreover, it exhibits strong robustness under perturbedconditions, sustaining stable task execution. Our results establishworld-model-based RFT as a practical post-training paradigm to enhance thegeneralization and robustness of VLA models. For more details, please refer tohttps://vla-rft.github.io/.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators | Papers | HyperAI