HyperAIHyperAI

Command Palette

Search for a command to run...

Dual-mode Strategy Optimization BPO

Date

2 months ago

Organization

Chinese Academy of Sciences
Tencent

Paper URL

2508.21113

Bi-mode Policy Optimization (BPO) was jointly proposed by Tencent Hunyuan Team and the Chinese Academy of Sciences in August 2025. The relevant research results were published in the paper "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning".

BPO is a reinforcement learning algorithm designed for automated thinking. Unlike existing reinforcement learning (RL) methods that require complex reward functions, are highly data-dependent, or are susceptible to hyperparameter sensitivity, BPO utilizes simple, rule-based mathematical rewards. This method enforces the inclusion of both thinking and non-thinking modes, preventing the model from being biased towards a particular mode during RL training.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Dual-mode Strategy Optimization BPO | Wiki | HyperAI