Command Palette
Search for a command to run...
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Abstract
Recent advancements in aligning large language models via reinforcementlearning have achieved remarkable gains in solving complex reasoning problems,but at the cost of expensive on-policy rollouts and limited exploration ofdiverse reasoning paths. In this work, we introduce TreePO, involving aself-guided rollout algorithm that views sequence generation as atree-structured searching process. Composed of dynamic tree sampling policy andfixed-length segment decoding, TreePO leverages local uncertainty to warrantadditional branches. By amortizing computation across common prefixes andpruning low-value paths early, TreePO essentially reduces the per-updatecompute burden while preserving or enhancing exploration diversity. Keycontributions include: (1) a segment-wise sampling algorithm that alleviatesthe KV cache burden through contiguous segments and spawns new branches alongwith an early-stop mechanism; (2) a tree-based segment-level advantageestimation that considers both global and local proximal policy optimization.and (3) analysis on the effectiveness of probability and quality-driven dynamicdivergence and fallback strategy. We empirically validate the performance gainof TreePO on a set reasoning benchmarks and the efficiency saving of GPU hoursfrom 22\% up to 43\% of the sampling design for the trained models, meanwhileshowing up to 40\% reduction at trajectory-level and 35\% at token-levelsampling compute for the existing models. While offering a free lunch ofinference efficiency, TreePO reveals a practical path toward scaling RL-basedpost-training with fewer samples and less compute. Home page locates athttps://m-a-p.ai/TreePO.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.