HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu Yizhou Zhou Zhou Ziheng Yingzhe Peng Xinyu Ye Xinting Hu Wenbo Zhu Lu Qi Ming-Hsuan Yang Xu Yang

On the Generalization of SFT: A Reinforcement Learning Perspective with
  Reward Rectification

Abstract

We present a simple yet theoretically motivated improvement to SupervisedFine-Tuning (SFT) for the Large Language Model (LLM), addressing its limitedgeneralization compared to reinforcement learning (RL). Through mathematicalanalysis, we reveal that standard SFT gradients implicitly encode a problematicreward structure that may severely restrict the generalization capabilities ofmodel. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizinggradient updates for each token by dynamically rescaling the objective functionwith the probability of this token. Remarkably, this single-line code changesignificantly outperforms standard SFT across multiple challenging benchmarksand base models, demonstrating greatly improved generalization. Additionally,our approach shows competitive results in offline RL settings, offering aneffective yet simpler alternative. This work bridges theoretical insight andpractical solutions, substantially advancing SFT performance. The code will beavailable at https://github.com/yongliang-wu/DFT.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification | Papers | HyperAI