HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou Ling Yang Jingwen Gu Jiahao Qiu Ke Shen Jingrui He Mengdi Wang

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought
  Reasoning in LLMs

Abstract

Process Reward Models (PRMs) have recently emerged as a powerful frameworkfor supervising intermediate reasoning steps in large language models (LLMs).Previous PRMs are primarily trained on model final output responses andstruggle to evaluate intermediate thinking trajectories robustly, especially inthe emerging setting of trajectory-response outputs generated by frontierreasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, anovel trajectory-aware PRM explicitly designed to evaluate thetrajectory-response type of reasoning traces. ReasonFlux-PRM incorporates bothstep-level and trajectory-level supervision, enabling fine-grained rewardassignment aligned with structured chain-of-thought data. We adaptReasonFlux-PRM to support reward supervision under both offline and onlinesettings, including (i) selecting high-quality model distillation data fordownstream supervised fine-tuning of smaller models, (ii) providing denseprocess-level rewards for policy optimization during reinforcement learning,and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical resultson challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamonddemonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs(e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, ourderived ReasonFlux-PRM-7B yields consistent performance improvements, achievingaverage gains of 12.1% in supervised fine-tuning, 4.5% in reinforcementlearning, and 6.3% in test-time scaling. We also release our efficientReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment.Projects: https://github.com/Gen-Verse/ReasonFlux

Code Repositories

gen-verse/reasonflux
Official
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs | Papers | HyperAI