Command Palette
Search for a command to run...

Abstract
Two major sources of training data exist for post-training modern languagemodels: online (model-generated rollouts) data, and offline (human orother-model demonstrations) data. These two types of data are typically used byapproaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT),respectively. In this paper, we show that these approaches are not incontradiction, but are instances of a single optimization process. We derive aUnified Policy Gradient Estimator, and present the calculations of a widespectrum of post-training approaches as the gradient of a common objectiveunder different data distribution assumptions and various bias-variancetradeoffs. The gradient estimator is constructed with four interchangeableparts: stabilization mask, reference policy denominator, advantage estimate,and likelihood gradient. Motivated by our theoretical findings, we proposeHybrid Post-Training (HPT), an algorithm that dynamically selects differenttraining signals. HPT is designed to yield both effective exploitation ofdemonstration and stable exploration without sacrificing learned reasoningpatterns. We provide extensive experiments and ablation studies to verify theeffectiveness of our unified theoretical framework and HPT. Across sixmathematical reasoning benchmarks and two out-of-distribution suites, HPTconsistently surpasses strong baselines across models of varying scales andfamilies.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.