HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

s1: Simple test-time scaling

Niklas Muennighoff Zitong Yang Weijia Shi Xiang Lisa Li Li Fei-Fei Hannaneh Hajishirzi Luke Zettlemoyer Percy Liang Emmanuel Candès Tatsunori Hashimoto

s1: Simple test-time scaling

Abstract

Test-time scaling is a promising new approach to language modeling that usesextra test-time compute to improve performance. Recently, OpenAI's o1 modelshowed this capability but did not publicly share its methodology, leading tomany replication efforts. We seek the simplest approach to achieve test-timescaling and strong reasoning performance. First, we curate a small dataset s1Kof 1,000 questions paired with reasoning traces relying on three criteria wevalidate through ablations: difficulty, diversity, and quality. Second, wedevelop budget forcing to control test-time compute by forcefully terminatingthe model's thinking process or lengthening it by appending "Wait" multipletimes to the model's generation when it tries to end. This can lead the modelto double-check its answer, often fixing incorrect reasoning steps. Aftersupervised finetuning the Qwen2.5-32B-Instruct language model on s1K andequipping it with budget forcing, our model s1 exceeds o1-preview oncompetition math questions by up to 27% (MATH and AIME24). Further, scaling s1with budget forcing allows extrapolating beyond its performance withouttest-time intervention: from 50% to 57% on AIME24. Our model, data, and codeare open-source at https://github.com/simplescaling/s1.

Code Repositories

simplescaling/s1
Official
pytorch
Mentioned in GitHub
huggingface/open-r1
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
mathematical-reasoning-on-aime24s1-32B
Acc: 56.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
s1: Simple test-time scaling | Papers | HyperAI