HyperAIHyperAI

Command Palette

Search for a command to run...

12 days ago

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via
  Balanced Policy Optimization with Adaptive Clipping

Abstract

Reinforcement learning (RL) has recently become the core paradigm foraligning and strengthening large language models (LLMs). Yet, applying RL inoff-policy settings--where stale data from past policies are used fortraining--improves sample efficiency, but remains challenging: policy entropydeclines sharply, optimization often becomes unstable and may even collapse.Through theoretical and empirical analysis, we identify two key insights: (i)an imbalance in optimization, where negative-advantage samples dominate thepolicy gradient, suppressing useful behaviors and risking gradient explosions;and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clippingmechanism in PPO-like objectives systematically blocks entropy-increasingupdates, thereby driving the policy toward over-exploitation at the expense ofexploration. Building on these insights, we propose BAlanced PolicyOptimization with Adaptive Clipping (BAPO), a simple yet effective method thatdynamically adjusts clipping bounds to adaptively re-balance positive andnegative contributions, preserve entropy, and stabilize RL optimization. Acrossdiverse off-policy scenarios--including sample replay and partial rollout--BAPOachieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025benchmarks, our 7B BAPO model surpasses open-source counterparts such asSkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-artresults among models of the same scale but also outperforms leading proprietarysystems like o3-mini and Gemini-2.5-Flash-Thinking.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping | Papers | HyperAI