Command Palette
Search for a command to run...
Dual-mode Strategy Optimization BPO
Bi-mode Policy Optimization (BPO) was jointly proposed by Tencent Hunyuan Team and the Chinese Academy of Sciences in August 2025. The relevant research results were published in the paper "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning".
BPO is a reinforcement learning algorithm designed for automated thinking. Unlike existing reinforcement learning (RL) methods that require complex reward functions, are highly data-dependent, or are susceptible to hyperparameter sensitivity, BPO utilizes simple, rule-based mathematical rewards. This method enforces the inclusion of both thinking and non-thinking modes, preventing the model from being biased towards a particular mode during RL training.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.