8 months ago

Abstract

In this work, we investigate the synergy between supervised fine-tuning (SFT)and reinforcement learning (RL) in developing strong reasoning models. We beginby curating the SFT training data through two scaling strategies: increasingthe number of collected prompts and the number of generated responses perprompt. Both approaches yield notable improvements in reasoning performance,with scaling the number of prompts resulting in more substantial gains. We thenexplore the following questions regarding the synergy between SFT and RL: (i)Does a stronger SFT model consistently lead to better final performance afterlarge-scale RL training? (ii) How can we determine an appropriate samplingtemperature during RL training to effectively balance exploration andexploitation for a given SFT initialization? Our findings suggest that (i)holds true, provided effective RL training is conducted, particularly when thesampling temperature is carefully chosen to maintain the temperature-adjustedentropy around 0.3, a setting that strikes a good balance between explorationand exploitation. Notably, the performance gap between initial SFT modelsnarrows significantly throughout the RL process. Leveraging a strong SFTfoundation and insights into the synergistic interplay between SFT and RL, ourAceReason-Nemotron-1.1 7B model significantly outperformsAceReason-Nemotron-1.0 and achieves new state-of-the-art performance amongQwen2.5-7B-based reasoning models on challenging math and code benchmarks,thereby demonstrating the effectiveness of our post-training recipe. We releasethe model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Supervised Fine-Tuning

Natural Language Processing

Task/Problem

Zihan Liu Zhuolin Yang Yang Chen Chankyu Lee Mohammad Shoeybi Bryan Catanzaro Wei Ping

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Supervised Fine-Tuning

Natural Language Processing

Task/Problem

Zihan Liu Zhuolin Yang Yang Chen Chankyu Lee Mohammad Shoeybi Bryan Catanzaro Wei Ping

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Zihan Liu Zhuolin Yang Yang Chen Chankyu Lee Mohammad Shoeybi Bryan Catanzaro Wei Ping

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Zihan Liu Zhuolin Yang Yang Chen Chankyu Lee Mohammad Shoeybi Bryan Catanzaro Wei Ping

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Zihan Liu Zhuolin Yang Yang Chen Chankyu Lee Mohammad Shoeybi Bryan Catanzaro Wei Ping

Abstract

Build AI with AI

HyperAI Newsletters