8 months ago

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promisingpotential in advancing the reasoning capabilities of LLMs. However, its successremains largely confined to mathematical and code domains. This primarylimitation stems from the heavy reliance on domain-specific verifiers, whichresults in prohibitive complexity and limited scalability. To address thechallenge, our key observation is that LLM's intrinsic probability ofgenerating a correct free-form answer directly indicates its own evaluation ofthe reasoning reward (i.e., how well the reasoning process leads to the correctanswer). Building on this insight, we propose RLPR, a simple verifier-freeframework that extrapolates RLVR to broader general domains. RLPR uses theLLM's own token probability scores for reference answers as the reward signaland maximizes the expected reward during training. We find that addressing thehigh variance of this noisy probability reward is crucial to make it work, andpropose prob-to-reward and stabilizing methods to ensure a precise and stablereward from LLM intrinsic probabilities. Comprehensive experiments in fourgeneral-domain benchmarks and three mathematical benchmarks show that RLPRconsistently improves reasoning capabilities in both areas for Gemma, Llama,and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6points on TheoremQA and 7.5 points on Minerva, and even surpasses strongverifier-model-dependent approaches General-Reasoner by 1.6 average pointsacross seven benchmarks.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Benchmarks

Reasoning

Supervised Fine-Tuning

AI Infra

Method/Architecture

Tianyu Yu Bo Ji Shouli Wang Shu Yao Zefan Wang Ganqu Cui Lifan Yuan Ning Ding Yuan Yao Zhiyuan Liu

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Benchmarks

Reasoning

Supervised Fine-Tuning

AI Infra

Method/Architecture

Tianyu Yu Bo Ji Shouli Wang Shu Yao Zefan Wang Ganqu Cui Lifan Yuan Ning Ding Yuan Yao Zhiyuan Liu

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

RLPR: Extrapolating RLVR to General Domains without Verifiers

Tianyu Yu Bo Ji Shouli Wang Shu Yao Zefan Wang Ganqu Cui Lifan Yuan Ning Ding Yuan Yao Zhiyuan Liu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

RLPR: Extrapolating RLVR to General Domains without Verifiers

Tianyu Yu Bo Ji Shouli Wang Shu Yao Zefan Wang Ganqu Cui Lifan Yuan Ning Ding Yuan Yao Zhiyuan Liu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

RLPR: Extrapolating RLVR to General Domains without Verifiers

Tianyu Yu Bo Ji Shouli Wang Shu Yao Zefan Wang Ganqu Cui Lifan Yuan Ning Ding Yuan Yao Zhiyuan Liu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Tianyu Yu Bo Ji Shouli Wang Shu Yao Zefan Wang Ganqu Cui Lifan Yuan Ning Ding Yuan Yao Zhiyuan Liu

Tianyu Yu Bo Ji Shouli Wang Shu Yao Zefan Wang Ganqu Cui Lifan Yuan Ning Ding Yuan Yao Zhiyuan Liu

Tianyu Yu Bo Ji Shouli Wang Shu Yao Zefan Wang Ganqu Cui Lifan Yuan Ning Ding Yuan Yao Zhiyuan Liu