Command Palette
Search for a command to run...
Insights from Benchmarking Frontier Language Models on Web App Code Generation
Yi Cui

Abstract
This paper presents insights from evaluating 16 frontier large languagemodels (LLMs) on the WebApp1K benchmark, a test suite designed to assess theability of LLMs to generate web application code. The results reveal that whileall models possess similar underlying knowledge, their performance isdifferentiated by the frequency of mistakes they make. By analyzing lines ofcode (LOC) and failure distributions, we find that writing correct code is morecomplex than generating incorrect code. Furthermore, prompt engineering showslimited efficacy in reducing errors beyond specific cases. These findingssuggest that further advancements in coding LLM should emphasize on modelreliability and mistake minimization.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| code-generation-on-webapp1k-react | claude-3.5-sonnet | pass@1: 0.8808 |
| code-generation-on-webapp1k-react | deepseek-coder-v2-instruct | pass@1: 0.7002 |
| code-generation-on-webapp1k-react | gpt-4o-2024-08-06 | pass@1: 0.885 |
| code-generation-on-webapp1k-react | mistral-large-2 | pass@1: 0.7804 |
| code-generation-on-webapp1k-react | llama-v3p1-405b-instruct | pass@1: 0.302 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.