HyperAI

Abstract

RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm forimproving the reasoning abilities of large language models (LLMs). Currentmethods rely primarily on policy optimization frameworks like PPO and GRPO,which follow generalized policy iteration that alternates between evaluatingthe current policy's value and improving the policy based on evaluation. Whileeffective, they often suffer from training instability and diversity collapse,requiring complex heuristic tricks and careful tuning. We observe that standardRLVR in math reasoning can be formalized as a specialized finite-horizon MarkovDecision Process with deterministic state transitions, tree-structureddynamics, and binary terminal rewards. Though large in scale, the underlyingstructure is simpler than general-purpose control settings for which popular RLalgorithms (e.g., PPO) were developed, suggesting that several sophisticatedtechniques in existing methods may be reduced or even omitted. Based on thisinsight, we prove a surprising result: the optimal action can be recovered fromthe Q-function of a fixed uniformly random policy, thereby bypassing thegeneralized policy iteration loop and its associated heuristics. We introduceRandom Policy Valuation for Diverse Reasoning (ROVER) to translate thisprinciple into a practical and scalable algorithm for LLM math reasoning, aminimalist yet highly effective RL method that samples actions from a softmaxover these uniform-policy Q-values. ROVER preserves diversity throughouttraining, allowing sustained exploration of multiple valid pathways. Acrossmultiple base models and standard math reasoning benchmarks, ROVER demonstratessuperior performance in both quality (+8.2 on pass@1,+16.8 on pass@256) and diversity (+17.6%), despiteits radical simplification compared to strong, complicated existing methods.

Abstract

Haoran He Yuxiao Ye Qingpeng Cai Chen Hu Binxing Jiao Daxin Jiang Ling Pan

Abstract

Build AI with AI

HyperAI Newsletters

Haoran He Yuxiao Ye Qingpeng Cai Chen Hu Binxing Jiao Daxin Jiang Ling Pan

Abstract

Build AI with AI

HyperAI Newsletters

Haoran He Yuxiao Ye Qingpeng Cai Chen Hu Binxing Jiao Daxin Jiang Ling Pan

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Haoran He Yuxiao Ye Qingpeng Cai Chen Hu Binxing Jiao Daxin Jiang Ling Pan

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Haoran He Yuxiao Ye Qingpeng Cai Chen Hu Binxing Jiao Daxin Jiang Ling Pan

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Haoran He Yuxiao Ye Qingpeng Cai Chen Hu Binxing Jiao Daxin Jiang Ling Pan

Abstract

Build AI with AI

HyperAI Newsletters