Command Palette
Search for a command to run...
Patrick Haller Jonas Golde Alan Akbik

Abstract
Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| code-generation-on-pecc | Llama-3-8B-Instruct | Pass@3: 3.1 |
| code-generation-on-pecc | Claude 3 Haiku | Pass@3: 27.67 |
| code-generation-on-pecc | chat-bison | Pass@3: 8.48 |
| code-generation-on-pecc | GPT-3.5 Turbo | Pass@3: 23.75 |
| code-generation-on-pecc | WizardLM-2-7B | Pass@3: 3.72 |
| code-generation-on-pecc | Mixtral-8x7B-Instruct | Pass@3: 8.35 |
| code-generation-on-pecc | codechat-bison | Pass@3: 11.39 |
| code-generation-on-pecc | Phi-3-mini-128k-instruct | Pass@3: 7.18 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.