HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

One Token to Fool LLM-as-a-Judge

Yulai Zhao Haolin Liu Dian Yu S. Y. Kung Haitao Mi Dong Yu

One Token to Fool LLM-as-a-Judge

Abstract

Generative reward models (also known as LLMs-as-judges), which use largelanguage models (LLMs) to evaluate answer quality, are increasingly adopted inreinforcement learning with verifiable rewards (RLVR). They are often preferredover rigid rule-based metrics, especially for complex reasoning tasks involvingfree-form outputs. In this paradigm, an LLM is typically prompted to compare acandidate answer against a ground-truth reference and assign a binary rewardindicating correctness. Despite the seeming simplicity of this comparison task,we find that generative reward models exhibit surprising vulnerabilities tosuperficial manipulations: non-word symbols (e.g., ":" or ".") or reasoningopeners like "Thought process:" and "Let's solve this problem step by step."can often lead to false positive rewards. We demonstrate that this weakness iswidespread across LLMs, datasets, and prompt formats, posing a serious threatfor core algorithmic paradigms that rely on generative reward models, such asrejection sampling, preference optimization, and RLVR. To mitigate this issue,we introduce a simple yet effective data augmentation strategy and train a newgenerative reward model with substantially improved robustness. Our findingshighlight the urgent need for more reliable LLM-based evaluation methods. Werelease our robust, general-domain reward model and its synthetic training dataat https://huggingface.co/sarosavo/Master-RM andhttps://huggingface.co/datasets/sarosavo/Master-RM.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
One Token to Fool LLM-as-a-Judge | Papers | HyperAI