Command Palette
Search for a command to run...
Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao

Abstract
We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMsin test-case generation. TestCase-Eval includes 500 algorithm problems and100,000 human-crafted solutions from the Codeforces platform. It focuses on twopivotal tasks: (1) Fault Coverage, which measures how well LLM-generated testsets probe diverse input scenarios and cover a wide range of potential failuremodes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailoredtest input that reveals a specific incorrect code implementation. We provide acomprehensive assessment of 19 state-of-the-art open-source and proprietaryLLMs on TestCase-Eval, offering insights into their strengths and limitationsin generating effective test cases for algorithm problems.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.