HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and
  Outcome Reward

Abstract

Answer verification is crucial not only for evaluating large language models(LLMs) by matching their unstructured outputs against standard answers, butalso serves as the reward model to guide LLM optimization. Most evaluationframeworks rely on regularized matching or employ general LLMs for answerverification, which demands extensive, repetitive customization for regex rulesor evaluation prompts. Two fundamental limitations persist in currentmethodologies: 1) the absence of comprehensive benchmarks that systematicallyevaluate verification capabilities across different LLMs; and 2) the nascentstage of verifier development, where existing approaches lack both therobustness to handle complex edge cases and the generalizability acrossdifferent domains. In this work, we develop CompassVerifier, an accurate androbust lightweight verifier model for evaluation and outcome reward. Itdemonstrates multi-domain competency spanning math, knowledge, and diversereasoning tasks, with the capability to process various answer types, includingmulti-subproblems, formulas, and sequence answers, while effectivelyidentifying abnormal/invalid responses. We introduce VerifierBench benchmarkcomprising model outputs collected from multiple data sources, augmentedthrough manual analysis of metaerror patterns to enhance CompassVerifier. Weanticipate that CompassVerifier and VerifierBench will facilitate answerverification, evaluation protocols, and reinforcement learning research. Codeand dataset are available at https://github.com/open-compass/CompassVerifier.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward | Papers | HyperAI