HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Abstract

The reliability of large language models (LLMs) during test-time scaling isoften assessed with external verifiers or reward models thatdistinguish correct reasoning from flawed logic. Prior work generally assumesthat process reward models (PRMs), which score every intermediate reasoningstep, outperform outcome reward models (ORMs) that assess only the finalanswer. This view is based mainly on evidence from narrow, math-adjacentdomains. We present the first unified evaluation of four reward model variants,discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM(\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom,we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is notcompetitive, and (iii) overall, \GenORM is the most robust, yieldingsignificant and consistent gains across every tested domain. We attribute thisto PRM-style stepwise scoring, which inherits label noise from LLMauto-labeling and has difficulty evaluating long reasoning trajectories,including those involving self-correcting reasoning. Our theoretical analysisshows that step-wise aggregation compounds errors as reasoning length grows,and our empirical observations confirm this effect. These findings challengethe prevailing assumption that fine-grained supervision is always better andsupport generative outcome verification for multi-domain deployment. Wepublicly release our code, datasets, and checkpoints athttps://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}}to facilitate future research in multi-domain settings.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Rethinking Reward Models for Multi-Domain Test-Time Scaling | Papers | HyperAI