a month ago

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Dong Bok Lee Seanie Lee Sangwoo Park Minki Kang Jinheon Baek Dongki Kim Dominik Wagner Jiongdao Jin Heejun Lee Tobias Bocklet

Abstract

The reliability of large language models (LLMs) during test-time scaling isoften assessed with external verifiers or reward models thatdistinguish correct reasoning from flawed logic. Prior work generally assumesthat process reward models (PRMs), which score every intermediate reasoningstep, outperform outcome reward models (ORMs) that assess only the finalanswer. This view is based mainly on evidence from narrow, math-adjacentdomains. We present the first unified evaluation of four reward model variants,discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM(\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom,we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is notcompetitive, and (iii) overall, \GenORM is the most robust, yieldingsignificant and consistent gains across every tested domain. We attribute thisto PRM-style stepwise scoring, which inherits label noise from LLMauto-labeling and has difficulty evaluating long reasoning trajectories,including those involving self-correcting reasoning. Our theoretical analysisshows that step-wise aggregation compounds errors as reasoning length grows,and our empirical observations confirm this effect. These findings challengethe prevailing assumption that fine-grained supervision is always better andsupport generative outcome verification for multi-domain deployment. Wepublicly release our code, datasets, and checkpoints athttps://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}}to facilitate future research in multi-domain settings.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Dong Bok Lee Seanie Lee Sangwoo Park Minki Kang Jinheon Baek Dongki Kim Dominik Wagner Jiongdao Jin Heejun Lee Tobias Bocklet5 more

Abstract

Build AI with AI

Hyper Newsletters

Dong Bok Lee Seanie Lee Sangwoo Park Minki Kang Jinheon Baek Dongki Kim Dominik Wagner Jiongdao Jin Heejun Lee Tobias Bocklet