a month ago

Tree Search for LLM Agent Reinforcement Learning

Yuxiang Ji Ziyu Ma Yong Wang Guanhua Chen Xiangxiang Chu Liaoni Wu

Abstract

Recent advances in reinforcement learning (RL) have significantly enhancedthe agentic capabilities of large language models (LLMs). In long-term andmulti-turn agent tasks, existing approaches driven solely by outcome rewardsoften suffer from the problem of sparse supervision. To address the challenge,we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a groupedagent RL method based on tree search, where each tree node represents thecomplete agent interaction step. By sharing common prefixes, the tree searchsampling increases the number of rollouts achievable within a fixed budget oftokens or tool calls. Moreover, we find that the tree-structured trajectorynaturally allows the construction of step-wise process supervised signals evenusing only the outcome reward. Based on this, Tree-GRPO estimates the groupedrelative advantages both on intra-tree and inter-tree levels. Throughtheoretical analysis, we demonstrate that the objective of intra-tree levelgroup relative policy optimization is equivalent to that of step-level directpreference learning. Experiments across 11 datasets and 3 types of QA tasksdemonstrate the superiority of the proposed tree-based RL over the chain-basedRL method.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Tree Search for LLM Agent Reinforcement Learning

Yuxiang Ji Ziyu Ma Yong Wang Guanhua Chen Xiangxiang Chu Liaoni Wu

Abstract

Build AI with AI

Hyper Newsletters