7 months ago

Abstract

Generative reward models (also known as LLMs-as-judges), which use largelanguage models (LLMs) to evaluate answer quality, are increasingly adopted inreinforcement learning with verifiable rewards (RLVR). They are often preferredover rigid rule-based metrics, especially for complex reasoning tasks involvingfree-form outputs. In this paradigm, an LLM is typically prompted to compare acandidate answer against a ground-truth reference and assign a binary rewardindicating correctness. Despite the seeming simplicity of this comparison task,we find that generative reward models exhibit surprising vulnerabilities tosuperficial manipulations: non-word symbols (e.g., ":" or ".") or reasoningopeners like "Thought process:" and "Let's solve this problem step by step."can often lead to false positive rewards. We demonstrate that this weakness iswidespread across LLMs, datasets, and prompt formats, posing a serious threatfor core algorithmic paradigms that rely on generative reward models, such asrejection sampling, preference optimization, and RLVR. To mitigate this issue,we introduce a simple yet effective data augmentation strategy and train a newgenerative reward model with substantially improved robustness. Our findingshighlight the urgent need for more reliable LLM-based evaluation methods. Werelease our robust, general-domain reward model and its synthetic training dataat https://huggingface.co/sarosavo/Master-RM andhttps://huggingface.co/datasets/sarosavo/Master-RM.

Source PDF View Code