Command Palette
Search for a command to run...
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng Yongyuan Liang Shuaiyi Huang Jianfeng Gao Hal Daumé III Andrey Kolobov Furong Huang Jianwei Yang

Abstract
Although large vision-language-action (VLA) models pretrained on extensiverobot datasets offer promising generalist policies for robotic learning, theystill struggle with spatial-temporal dynamics in interactive robotics, makingthem less effective in handling complex tasks, such as manipulation. In thiswork, we introduce visual trace prompting, a simple yet effective approach tofacilitate VLA models' spatial-temporal awareness for action prediction byencoding state-action trajectories visually. We develop a new TraceVLA model byfinetuning OpenVLA on our own collected dataset of 150K robot manipulationtrajectories using visual trace prompting. Evaluations of TraceVLA across 137configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstratestate-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and3.5x on real-robot tasks and exhibiting robust generalization across diverseembodiments and scenarios. To further validate the effectiveness and generalityof our method, we present a compact VLA model based on 4B Phi-3-Vision,pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7BOpenVLA baseline while significantly improving inference efficiency.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| robot-manipulation-on-simpler-env | TraceVLA | Variant Aggregation: 0.450 Variant Aggregation-Move Near: 0.564 Variant Aggregation-Open/Close Drawer: 0.310 Variant Aggregation-Pick Coke Can: 0.600 Visual Matching: 0.460 Visual Matching-Move Near: 0.600 Visual Matching-Open/Close Drawer: 0.240 Visual Matching-Pick Coke Can: 0.560 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.