HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao

Can LLMs Generate High-Quality Test Cases for Algorithm Problems?
  TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Abstract

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMsin test-case generation. TestCase-Eval includes 500 algorithm problems and100,000 human-crafted solutions from the Codeforces platform. It focuses on twopivotal tasks: (1) Fault Coverage, which measures how well LLM-generated testsets probe diverse input scenarios and cover a wide range of potential failuremodes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailoredtest input that reveals a specific incorrect code implementation. We provide acomprehensive assessment of 19 state-of-the-art open-source and proprietaryLLMs on TestCase-Eval, offering insights into their strengths and limitationsin generating effective test cases for algorithm problems.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure | Papers | HyperAI