HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

A Case Study of Web App Coding with OpenAI Reasoning Models

Yi Cui

A Case Study of Web App Coding with OpenAI Reasoning Models

Abstract

This paper presents a case study of coding tasks by the latest reasoningmodels of OpenAI, i.e. o1-preview and o1-mini, in comparison with otherfrontier models. The o1 models deliver SOTA results for WebApp1K, a single-taskbenchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doublingnumber of tasks and test cases. The new benchmark causes the o1 modelperformances to decline significantly, falling behind Claude 3.5. Moreover,they consistently fail when confronted with atypical yet correct test cases, atrap non-reasoning models occasionally avoid. We hypothesize that theperformance variability is due to instruction comprehension. Specifically, thereasoning mechanism boosts performance when all expectations are captured,meanwhile exacerbates errors when key expectations are missed, potentiallyimpacted by input lengths. As such, we argue that the coding success ofreasoning models hinges on the top-notch base model and SFT to ensuremeticulous adherence to instructions.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
code-generation-on-webapp1k-duo-reactclaude-3-5-sonnet
pass@1: 0.679
code-generation-on-webapp1k-duo-reactmistral-large-2
pass@1: 0.449
code-generation-on-webapp1k-duo-reactdeepseek-v2.5
pass@1: 0.49
code-generation-on-webapp1k-duo-reacto1-preview
pass@1: 0.652
code-generation-on-webapp1k-duo-reacto1-mini
pass@1: 0.667
code-generation-on-webapp1k-duo-reactgpt-4o-2024-08-06
pass@1: 0.531
code-generation-on-webapp1k-reactdeepseek-v2.5
pass@1: 0.834
code-generation-on-webapp1k-reacto1-mini
pass@1: 0.939
code-generation-on-webapp1k-reacto1-preview
pass@1: 0.952

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
A Case Study of Web App Coding with OpenAI Reasoning Models | Papers | HyperAI