HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

V\u00edctor Gallego

Specification Self-Correction: Mitigating In-Context Reward Hacking
  Through Test-Time Refinement

Abstract

Language models (LMs) are susceptible to in-context reward hacking, wherethey exploit flaws in tainted or faulty written specifications or rubrics toachieve high scores without fulfilling the user's true intent. We introduceSpecification Self-Correction (SSC), a novel, test-time framework that enablesan LM to identify and correct flaws within its own guiding specification. SSCemploys a multi-step inference process where the model first generates aresponse based on a potentially tainted specification, critiques its output,and then revises the specification itself to remove the exploitable loophole. Afinal, more robust response is then generated using this self-correctedspecification. Across experiments spanning creative writing and agentic codingtasks with several LMs, we demonstrate that while models initially game taintedspecifications in 50-70\% of cases, the SSC process reduces this vulnerabilityby over 90\%. This dynamic repair occurs at inference time, requires no weightmodification, and leads to more robustly aligned model behavior. Code athttps://github.com/vicgalle/specification-self-correction .

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement | Papers | HyperAI