6 months ago

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang

Abstract

Answer verification is crucial not only for evaluating large language models(LLMs) by matching their unstructured outputs against standard answers, butalso serves as the reward model to guide LLM optimization. Most evaluationframeworks rely on regularized matching or employ general LLMs for answerverification, which demands extensive, repetitive customization for regex rulesor evaluation prompts. Two fundamental limitations persist in currentmethodologies: 1) the absence of comprehensive benchmarks that systematicallyevaluate verification capabilities across different LLMs; and 2) the nascentstage of verifier development, where existing approaches lack both therobustness to handle complex edge cases and the generalizability acrossdifferent domains. In this work, we develop CompassVerifier, an accurate androbust lightweight verifier model for evaluation and outcome reward. Itdemonstrates multi-domain competency spanning math, knowledge, and diversereasoning tasks, with the capability to process various answer types, includingmulti-subproblems, formulas, and sequence answers, while effectivelyidentifying abnormal/invalid responses. We introduce VerifierBench benchmarkcomprising model outputs collected from multiple data sources, augmentedthrough manual analysis of metaerror patterns to enhance CompassVerifier. Weanticipate that CompassVerifier and VerifierBench will facilitate answerverification, evaluation protocols, and reinforcement learning research. Codeand dataset are available at https://github.com/open-compass/CompassVerifier.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang1 more

Abstract

Build AI with AI

HyperAI Newsletters

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang

Shudong Liu Hongwei Liu Junnan Liu Linchen Xiao Songyang Gao Chengqi Lyu Yuzhe Gu Wenwei Zhang Derek F. Wong Songyang Zhang