a month ago

Quantile Advantage Estimation for Entropy-Safe Reasoning

Junkang Wu Kexin Huang Jiancan Wu An Zhang Xiang Wang Xiangnan He

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLMreasoning, but training often oscillates between {entropy collapse} and{entropy explosion}. We trace both hazards to the mean baseline used invalue-free RL (e.g., GRPO and DAPO), which improperly penalizesnegative-advantage samples under reward outliers. We propose {QuantileAdvantage Estimation} (QAE), replacing the mean with a group-wise K-quantilebaseline. QAE induces a response-level, two-regime gate: on hard queries (p <=1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) ittargets remaining failures. Under first-order softmax updates, we prove{two-sided entropy safety}, giving lower and upper bounds on one-step entropychange that curb explosion and prevent collapse. Empirically, this minimalmodification stabilizes entropy, sparsifies credit assignment (with tuned K,roughly 80% of responses receive zero advantage), and yields sustained pass@1gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These resultsidentify {baseline design} -- rather than token-level heuristics -- as theprimary mechanism for scaling RLVR.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Quantile Advantage Estimation for Entropy-Safe Reasoning

Junkang Wu Kexin Huang Jiancan Wu An Zhang Xiang Wang Xiangnan He

Abstract

Build AI with AI

Hyper Newsletters