4 days ago

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao

Abstract

Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

4 days ago

Supervised Fine-Tuning

Reinforcement Learning

Agent

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

4 days ago

Supervised Fine-Tuning

Reinforcement Learning

Agent

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao2 more

Abstract

Build AI with AI

HyperAI Newsletters

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao

Junkeun Yi Damon Mosk-Aoyama Baihe Huang Ritu Gala Charles Wang Sugam Dipak Devare Khushi Bhardwaj Abhibha Gupta Oleksii Kuchaiev Jiantao Jiao