Date

9 months ago

Organization

Paper URL

2508.21113

Tags

Artificial Intelligence

Machine Learning

Computer Vision

Bi-mode Policy Optimization (BPO) was jointly proposed by Tencent Hunyuan Team and the Chinese Academy of Sciences in August 2025. The relevant research results were published in the paper "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning".

BPO is a reinforcement learning algorithm designed for automated thinking. Unlike existing reinforcement learning (RL) methods that require complex reward functions, are highly data-dependent, or are susceptible to hyperparameter sensitivity, BPO utilizes simple, rule-based mathematical rewards. This method enforces the inclusion of both thinking and non-thinking modes, preventing the model from being biased towards a particular mode during RL training.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Command Palette

Dual-mode Strategy Optimization BPO

Build AI with AI

HyperAI Newsletters

Command Palette

Dual-mode Strategy Optimization BPO

Related Wiki

Mean Speed Strategy (MVP)

Safety Comparison Method: Deep Aligned Visual Safety Prompt

Sparse Code Tree Decoding Tree Sketching

Dense Retriever

Guided Thought Reinforcement

Learning While Deploying

WorldGen

Decomposed Forward Pass (DePass)

iSeal Fingerprint Recognition Method

Build AI with AI

HyperAI Newsletters

Command Palette

Dual-mode Strategy Optimization BPO

Related Wiki

Mean Speed Strategy (MVP)

Safety Comparison Method: Deep Aligned Visual Safety Prompt

Sparse Code Tree Decoding Tree Sketching

Dense Retriever

Guided Thought Reinforcement

Learning While Deploying

WorldGen

Decomposed Forward Pass (DePass)

iSeal Fingerprint Recognition Method

Build AI with AI

HyperAI Newsletters

Related Wiki

Mean Speed Strategy (MVP)

Safety Comparison Method: Deep Aligned Visual Safety Prompt

Sparse Code Tree Decoding Tree Sketching

Dense Retriever

Guided Thought Reinforcement

Learning While Deploying

WorldGen

Decomposed Forward Pass (DePass)

iSeal Fingerprint Recognition Method

Related Wiki

Mean Speed Strategy (MVP)

Safety Comparison Method: Deep Aligned Visual Safety Prompt

Sparse Code Tree Decoding Tree Sketching

Dense Retriever

Guided Thought Reinforcement

Learning While Deploying

WorldGen

Decomposed Forward Pass (DePass)

iSeal Fingerprint Recognition Method