HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Zihan Liu Zhuolin Yang Yang Chen Chankyu Lee Mohammad Shoeybi Bryan Catanzaro Wei Ping

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT
  and RL Synergy

Abstract

In this work, we investigate the synergy between supervised fine-tuning (SFT)and reinforcement learning (RL) in developing strong reasoning models. We beginby curating the SFT training data through two scaling strategies: increasingthe number of collected prompts and the number of generated responses perprompt. Both approaches yield notable improvements in reasoning performance,with scaling the number of prompts resulting in more substantial gains. We thenexplore the following questions regarding the synergy between SFT and RL: (i)Does a stronger SFT model consistently lead to better final performance afterlarge-scale RL training? (ii) How can we determine an appropriate samplingtemperature during RL training to effectively balance exploration andexploitation for a given SFT initialization? Our findings suggest that (i)holds true, provided effective RL training is conducted, particularly when thesampling temperature is carefully chosen to maintain the temperature-adjustedentropy around 0.3, a setting that strikes a good balance between explorationand exploitation. Notably, the performance gap between initial SFT modelsnarrows significantly throughout the RL process. Leveraging a strong SFTfoundation and insights into the synergistic interplay between SFT and RL, ourAceReason-Nemotron-1.1 7B model significantly outperformsAceReason-Nemotron-1.0 and achieves new state-of-the-art performance amongQwen2.5-7B-based reasoning models on challenging math and code benchmarks,thereby demonstrating the effectiveness of our post-training recipe. We releasethe model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy | Papers | HyperAI