HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

RLPR: Extrapolating RLVR to General Domains without Verifiers

RLPR: Extrapolating RLVR to General Domains without Verifiers

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promisingpotential in advancing the reasoning capabilities of LLMs. However, its successremains largely confined to mathematical and code domains. This primarylimitation stems from the heavy reliance on domain-specific verifiers, whichresults in prohibitive complexity and limited scalability. To address thechallenge, our key observation is that LLM's intrinsic probability ofgenerating a correct free-form answer directly indicates its own evaluation ofthe reasoning reward (i.e., how well the reasoning process leads to the correctanswer). Building on this insight, we propose RLPR, a simple verifier-freeframework that extrapolates RLVR to broader general domains. RLPR uses theLLM's own token probability scores for reference answers as the reward signaland maximizes the expected reward during training. We find that addressing thehigh variance of this noisy probability reward is crucial to make it work, andpropose prob-to-reward and stabilizing methods to ensure a precise and stablereward from LLM intrinsic probabilities. Comprehensive experiments in fourgeneral-domain benchmarks and three mathematical benchmarks show that RLPRconsistently improves reasoning capabilities in both areas for Gemma, Llama,and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6points on TheoremQA and 7.5 points on Minerva, and even surpasses strongverifier-model-dependent approaches General-Reasoner by 1.6 average pointsacross seven benchmarks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
RLPR: Extrapolating RLVR to General Domains without Verifiers | Papers | HyperAI