HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou; Tianjian Lu; Swaroop Mishra; Siddhartha Brahma; Sujoy Basu; Yi Luan; Denny Zhou; Le Hou

Instruction-Following Evaluation for Large Language Models

Abstract

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Benchmarks

BenchmarkMethodologyMetrics
instruction-following-on-ifevalPaLM 2 S
Inst-level loose-accuracy: 59.11
Inst-level strict-accuracy: 55.76
Prompt-level loose-accuracy: 46.95
Prompt-level strict-accuracy: 43.07
instruction-following-on-ifevalGPT-4
Inst-level loose-accuracy: 85.37
Inst-level strict-accuracy: 83.57
Prompt-level loose-accuracy: 79.3
Prompt-level strict-accuracy: 76.89

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Instruction-Following Evaluation for Large Language Models | Papers | HyperAI