8 months ago

Supervised Fine-Tuning

Method/Architecture

Niklas Muennighoff Zitong Yang Weijia Shi Xiang Lisa Li Li Fei-Fei Hannaneh Hajishirzi Luke Zettlemoyer Percy Liang Emmanuel Candès Tatsunori Hashimoto

Abstract

Test-time scaling is a promising new approach to language modeling that usesextra test-time compute to improve performance. Recently, OpenAI's o1 modelshowed this capability but did not publicly share its methodology, leading tomany replication efforts. We seek the simplest approach to achieve test-timescaling and strong reasoning performance. First, we curate a small dataset s1Kof 1,000 questions paired with reasoning traces relying on three criteria wevalidate through ablations: difficulty, diversity, and quality. Second, wedevelop budget forcing to control test-time compute by forcefully terminatingthe model's thinking process or lengthening it by appending "Wait" multipletimes to the model's generation when it tries to end. This can lead the modelto double-check its answer, often fixing incorrect reasoning steps. Aftersupervised finetuning the Qwen2.5-32B-Instruct language model on s1K andequipping it with budget forcing, our model s1 exceeds o1-preview oncompetition math questions by up to 27% (MATH and AIME24). Further, scaling s1with budget forcing allows extrapolating beyond its performance withouttest-time intervention: from 50% to 57% on AIME24. Our model, data, and codeare open-source at https://github.com/simplescaling/s1.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Supervised Fine-Tuning

Method/Architecture

Niklas Muennighoff Zitong Yang Weijia Shi Xiang Lisa Li Li Fei-Fei Hannaneh Hajishirzi Luke Zettlemoyer Percy Liang Emmanuel Candès Tatsunori Hashimoto

Abstract

Test-time scaling is a promising new approach to language modeling that usesextra test-time compute to improve performance. Recently, OpenAI's o1 modelshowed this capability but did not publicly share its methodology, leading tomany replication efforts. We seek the simplest approach to achieve test-timescaling and strong reasoning performance. First, we curate a small dataset s1Kof 1,000 questions paired with reasoning traces relying on three criteria wevalidate through ablations: difficulty, diversity, and quality. Second, wedevelop budget forcing to control test-time compute by forcefully terminatingthe model's thinking process or lengthening it by appending "Wait" multipletimes to the model's generation when it tries to end. This can lead the modelto double-check its answer, often fixing incorrect reasoning steps. Aftersupervised finetuning the Qwen2.5-32B-Instruct language model on s1K andequipping it with budget forcing, our model s1 exceeds o1-preview oncompetition math questions by up to 27% (MATH and AIME24). Further, scaling s1with budget forcing allows extrapolating beyond its performance withouttest-time intervention: from 50% to 57% on AIME24. Our model, data, and codeare open-source at https://github.com/simplescaling/s1.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp