HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

Quantile Advantage Estimation for Entropy-Safe Reasoning

Junkang Wu Kexin Huang Jiancan Wu An Zhang Xiang Wang Xiangnan He

Quantile Advantage Estimation for Entropy-Safe Reasoning

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLMreasoning, but training often oscillates between {entropy collapse} and{entropy explosion}. We trace both hazards to the mean baseline used invalue-free RL (e.g., GRPO and DAPO), which improperly penalizesnegative-advantage samples under reward outliers. We propose {QuantileAdvantage Estimation} (QAE), replacing the mean with a group-wise K-quantilebaseline. QAE induces a response-level, two-regime gate: on hard queries (p <=1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) ittargets remaining failures. Under first-order softmax updates, we prove{two-sided entropy safety}, giving lower and upper bounds on one-step entropychange that curb explosion and prevent collapse. Empirically, this minimalmodification stabilizes entropy, sparsifies credit assignment (with tuned K,roughly 80% of responses receive zero advantage), and yields sustained pass@1gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These resultsidentify {baseline design} -- rather than token-level heuristics -- as theprimary mechanism for scaling RLVR.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Quantile Advantage Estimation for Entropy-Safe Reasoning | Papers | HyperAI