HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

ExGRPO: Learning to Reason from Experience

Runzhe Zhan Yafu Li Zhi Wang Xiaoye Qu Dongrui Liu Jing Shao Derek F. Wong Yu Cheng

ExGRPO: Learning to Reason from Experience

Abstract

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigmfor improving the reasoning ability of large language models. However, standardon-policy training discards rollout experiences after a single update, leadingto computational inefficiency and instability. While prior work on RL hashighlighted the benefits of reusing past experience, the role of experiencecharacteristics in shaping learning dynamics of large reasoning models remainsunderexplored. In this paper, we are the first to investigate what makes areasoning experience valuable and identify rollout correctness and entropy aseffective indicators of experience value. Based on these insights, we proposeExGRPO (Experiential Group Relative Policy Optimization), a framework thatorganizes and prioritizes valuable experiences, and employs a mixed-policyobjective to balance exploration with experience exploitation. Experiments onfive backbone models (1.5B-8B parameters) show that ExGRPO consistentlyimproves reasoning performance on mathematical/general benchmarks, with anaverage gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPOstabilizes training on both stronger and weaker models where on-policy methodsfail. These results highlight principled experience management as a keyingredient for efficient and scalable RLVR.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ExGRPO: Learning to Reason from Experience | Papers | HyperAI