3 months ago

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu Yizhou Zhou Zhou Ziheng Yingzhe Peng Xinyu Ye Xinting Hu Wenbo Zhu Lu Qi Ming-Hsuan Yang Xu Yang

Abstract

We present a simple yet theoretically motivated improvement to SupervisedFine-Tuning (SFT) for the Large Language Model (LLM), addressing its limitedgeneralization compared to reinforcement learning (RL). Through mathematicalanalysis, we reveal that standard SFT gradients implicitly encode a problematicreward structure that may severely restrict the generalization capabilities ofmodel. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizinggradient updates for each token by dynamically rescaling the objective functionwith the probability of this token. Remarkably, this single-line code changesignificantly outperforms standard SFT across multiple challenging benchmarksand base models, demonstrating greatly improved generalization. Additionally,our approach shows competitive results in offline RL settings, offering aneffective yet simpler alternative. This work bridges theoretical insight andpractical solutions, substantially advancing SFT performance. The code will beavailable at https://github.com/yongliang-wu/DFT.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu Yizhou Zhou Zhou Ziheng Yingzhe Peng Xinyu Ye Xinting Hu Wenbo Zhu Lu Qi Ming-Hsuan Yang Xu Yang

Abstract

Build AI with AI

Hyper Newsletters