2 months ago

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Yizhi Li Qingshui Gu Zhoufutu Wen Ziniu Li Tianshun Xing Shuyue Guo Tianyu Zheng Xin Zhou Xingwei Qu Wangchunshu Zhou

Abstract

Recent advancements in aligning large language models via reinforcementlearning have achieved remarkable gains in solving complex reasoning problems,but at the cost of expensive on-policy rollouts and limited exploration ofdiverse reasoning paths. In this work, we introduce TreePO, involving aself-guided rollout algorithm that views sequence generation as atree-structured searching process. Composed of dynamic tree sampling policy andfixed-length segment decoding, TreePO leverages local uncertainty to warrantadditional branches. By amortizing computation across common prefixes andpruning low-value paths early, TreePO essentially reduces the per-updatecompute burden while preserving or enhancing exploration diversity. Keycontributions include: (1) a segment-wise sampling algorithm that alleviatesthe KV cache burden through contiguous segments and spawns new branches alongwith an early-stop mechanism; (2) a tree-based segment-level advantageestimation that considers both global and local proximal policy optimization.and (3) analysis on the effectiveness of probability and quality-driven dynamicdivergence and fallback strategy. We empirically validate the performance gainof TreePO on a set reasoning benchmarks and the efficiency saving of GPU hoursfrom 22\% up to 43\% of the sampling design for the trained models, meanwhileshowing up to 40\% reduction at trajectory-level and 35\% at token-levelsampling compute for the existing models. While offering a free lunch ofinference efficiency, TreePO reveals a practical path toward scaling RL-basedpost-training with fewer samples and less compute. Home page locates athttps://m-a-p.ai/TreePO.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Yizhi Li Qingshui Gu Zhoufutu Wen Ziniu Li Tianshun Xing Shuyue Guo Tianyu Zheng Xin Zhou Xingwei Qu Wangchunshu Zhou7 more

Abstract

Build AI with AI

Hyper Newsletters

Yizhi Li Qingshui Gu Zhoufutu Wen Ziniu Li Tianshun Xing Shuyue Guo Tianyu Zheng Xin Zhou Xingwei Qu Wangchunshu Zhou