HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

MAPO: Mixed Advantage Policy Optimization

MAPO: Mixed Advantage Policy Optimization

Abstract

Recent advances in reinforcement learning for foundation models, such asGroup Relative Policy Optimization (GRPO), have significantly improved theperformance of foundation models on reasoning tasks. Notably, the advantagefunction serves as a central mechanism in GRPO for ranking the trajectoryimportance. However, existing explorations encounter both advantage reversionand advantage mirror problems, which hinder the reasonable advantage allocationacross different query samples. In this work, we propose an easy but effectiveGRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that thetrajectory appears with different certainty and propose the advantage percentdeviation for samples with high-certainty trajectories. Furthermore, wedynamically reweight the advantage function for samples with varying trajectorycertainty, thereby adaptively configuring the advantage function to account forsample-specific characteristics. Comparison with related state-of-the-artmethods, along with ablation studies on different advantage variants, validatesthe effectiveness of our approach.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MAPO: Mixed Advantage Policy Optimization | Papers | HyperAI