HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

FlowRL: Matching Reward Distributions for LLM Reasoning

FlowRL: Matching Reward Distributions for LLM Reasoning

Abstract

We propose FlowRL: matching the full reward distribution via flow balancinginstead of maximizing rewards in large language model (LLM) reinforcementlearning (RL). Recent advanced reasoning models adopt reward-maximizing methods(\eg, PPO and GRPO), which tend to over-optimize dominant reward signals whileneglecting less frequent but valid reasoning paths, thus reducing diversity. Incontrast, we transform scalar rewards into a normalized target distributionusing a learnable partition function, and then minimize the reverse KLdivergence between the policy and the target distribution. We implement thisidea as a flow-balanced optimization method that promotes diverse explorationand generalizable reasoning trajectories. We conduct experiments on math andcode reasoning tasks: FlowRL achieves a significant average improvement of10.0% over GRPO and 5.1% over PPO on math benchmarks, and performsconsistently better on code reasoning tasks. These results highlight rewarddistribution-matching as a key step toward efficient exploration and diversereasoning in LLM reinforcement learning.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
FlowRL: Matching Reward Distributions for LLM Reasoning | Papers | HyperAI