HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT
  Improvements

Abstract

Rapid advancements in large language models (LLMs) have the potential toassist in scientific progress. A critical capability toward this endeavor isthe ability to reproduce existing work. To evaluate the ability of AI agents toreproduce results in an active research area, we introduce the Automated LLMSpeedrunning Benchmark, leveraging the research community contributions on theNanoGPT speedrun, a competition to train a GPT-2 model in the shortest time.Each of the 19 speedrun tasks provides the agent with the previous recordstraining script, optionally paired with one of three hint formats, ranging frompseudocode to paper-like descriptions of the new records improvements. Recordsexecute quickly by design and speedrun improvements encompass diversecode-level changes, ranging from high-level algorithmic advancements tohardware-aware optimizations. These features make the benchmark both accessibleand realistic for the frontier problem of improving LLM training. We find thatrecent reasoning LLMs combined with SoTA scaffolds struggle to reimplementalready-known innovations in our benchmark, even when given detailed hints. Ourbenchmark thus provides a simple, non-saturated measure of an LLMs ability toautomate scientific reproduction, a necessary (but not sufficient) skill for anautonomous research agent.

Code Repositories

facebookresearch/llm-speedrunner
Official
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements | Papers | HyperAI