Command Palette
Search for a command to run...
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Abstract
Vision-Language-Action (VLA) models typically bridge the gap betweenperceptual and action spaces by pre-training a large-scale Vision-LanguageModel (VLM) on robotic data. While this approach greatly enhances performance,it also incurs significant training costs. In this paper, we investigate how toeffectively bridge vision-language (VL) representations to action (A). Weintroduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLAmodels on large-scale VLMs and extensive pre-training. To this end, we firstsystematically analyze the effectiveness of various VL conditions and presentkey findings on which conditions are essential for bridging perception andaction spaces. Based on these insights, we propose a lightweight Policy modulewith Bridge Attention, which autonomously injects the optimal condition intothe action space. In this way, our method achieves high performance using onlya 0.5B-parameter backbone, without any robotic data pre-training. Extensiveexperiments on both simulated and real-world robotic benchmarks demonstratethat VLA-Adapter not only achieves state-of-the-art level performance, but alsooffers the fast inference speed reported to date. Furthermore, thanks to theproposed advanced bridging paradigm, VLA-Adapter enables the training of apowerful VLA model in just 8 hours on a single consumer-grade GPU, greatlylowering the barrier to deploying the VLA model. Project page:https://vla-adapter.github.io/.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.