Command Palette
Search for a command to run...
Tree Search for LLM Agent Reinforcement Learning
Tree Search for LLM Agent Reinforcement Learning
Yuxiang Ji Ziyu Ma Yong Wang Guanhua Chen Xiangxiang Chu Liaoni Wu
Abstract
Recent advances in reinforcement learning (RL) have significantly enhancedthe agentic capabilities of large language models (LLMs). In long-term andmulti-turn agent tasks, existing approaches driven solely by outcome rewardsoften suffer from the problem of sparse supervision. To address the challenge,we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a groupedagent RL method based on tree search, where each tree node represents thecomplete agent interaction step. By sharing common prefixes, the tree searchsampling increases the number of rollouts achievable within a fixed budget oftokens or tool calls. Moreover, we find that the tree-structured trajectorynaturally allows the construction of step-wise process supervised signals evenusing only the outcome reward. Based on this, Tree-GRPO estimates the groupedrelative advantages both on intra-tree and inter-tree levels. Throughtheoretical analysis, we demonstrate that the objective of intra-tree levelgroup relative policy optimization is equivalent to that of step-level directpreference learning. Experiments across 11 datasets and 3 types of QA tasksdemonstrate the superiority of the proposed tree-based RL over the chain-basedRL method.