8 months ago

Abstract

Process Reward Models (PRMs) have recently emerged as a powerful frameworkfor supervising intermediate reasoning steps in large language models (LLMs).Previous PRMs are primarily trained on model final output responses andstruggle to evaluate intermediate thinking trajectories robustly, especially inthe emerging setting of trajectory-response outputs generated by frontierreasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, anovel trajectory-aware PRM explicitly designed to evaluate thetrajectory-response type of reasoning traces. ReasonFlux-PRM incorporates bothstep-level and trajectory-level supervision, enabling fine-grained rewardassignment aligned with structured chain-of-thought data. We adaptReasonFlux-PRM to support reward supervision under both offline and onlinesettings, including (i) selecting high-quality model distillation data fordownstream supervised fine-tuning of smaller models, (ii) providing denseprocess-level rewards for policy optimization during reinforcement learning,and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical resultson challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamonddemonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs(e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, ourderived ReasonFlux-PRM-7B yields consistent performance improvements, achievingaverage gains of 12.1% in supervised fine-tuning, 4.5% in reinforcementlearning, and 6.3% in test-time scaling. We also release our efficientReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment.Projects: https://github.com/Gen-Verse/ReasonFlux

Source PDF View Code