Command Palette
Search for a command to run...
Junkang Wu Kexin Huang Jiancan Wu An Zhang Xiang Wang Xiangnan He

Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLMreasoning, but training often oscillates between {entropy collapse} and{entropy explosion}. We trace both hazards to the mean baseline used invalue-free RL (e.g., GRPO and DAPO), which improperly penalizesnegative-advantage samples under reward outliers. We propose {QuantileAdvantage Estimation} (QAE), replacing the mean with a group-wise K-quantilebaseline. QAE induces a response-level, two-regime gate: on hard queries (p <=1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) ittargets remaining failures. Under first-order softmax updates, we prove{two-sided entropy safety}, giving lower and upper bounds on one-step entropychange that curb explosion and prevent collapse. Empirically, this minimalmodification stabilizes entropy, sparsifies credit assignment (with tuned K,roughly 80% of responses receive zero advantage), and yields sustained pass@1gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These resultsidentify {baseline design} -- rather than token-level heuristics -- as theprimary mechanism for scaling RLVR.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.