Command Palette
Search for a command to run...
Yuxiang Ji Ziyu Ma Yong Wang Guanhua Chen Xiangxiang Chu Liaoni Wu

Abstract
Recent advances in reinforcement learning (RL) have significantly enhancedthe agentic capabilities of large language models (LLMs). In long-term andmulti-turn agent tasks, existing approaches driven solely by outcome rewardsoften suffer from the problem of sparse supervision. To address the challenge,we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a groupedagent RL method based on tree search, where each tree node represents thecomplete agent interaction step. By sharing common prefixes, the tree searchsampling increases the number of rollouts achievable within a fixed budget oftokens or tool calls. Moreover, we find that the tree-structured trajectorynaturally allows the construction of step-wise process supervised signals evenusing only the outcome reward. Based on this, Tree-GRPO estimates the groupedrelative advantages both on intra-tree and inter-tree levels. Throughtheoretical analysis, we demonstrate that the objective of intra-tree levelgroup relative policy optimization is equivalent to that of step-level directpreference learning. Experiments across 11 datasets and 3 types of QA tasksdemonstrate the superiority of the proposed tree-based RL over the chain-basedRL method.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.