3 months ago

Group Sequence Policy Optimization

Chujie Zheng Shixuan Liu Mingze Li Xiong-Hui Chen Bowen Yu Chang Gao Kai Dang Yuqiong Liu Rui Men An Yang

Abstract

This paper introduces Group Sequence Policy Optimization (GSPO), our stable,efficient, and performant reinforcement learning algorithm for training largelanguage models. Unlike previous algorithms that adopt token-level importanceratios, GSPO defines the importance ratio based on sequence likelihood andperforms sequence-level clipping, rewarding, and optimization. We demonstratethat GSPO achieves superior training efficiency and performance compared to theGRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, andhas the potential for simplifying the design of RL infrastructure. These meritsof GSPO have contributed to the remarkable improvements in the latest Qwen3models.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Group Sequence Policy Optimization

Chujie Zheng Shixuan Liu Mingze Li Xiong-Hui Chen Bowen Yu Chang Gao Kai Dang Yuqiong Liu Rui Men An Yang2 more

Abstract

Build AI with AI

Hyper Newsletters

Chujie Zheng Shixuan Liu Mingze Li Xiong-Hui Chen Bowen Yu Chang Gao Kai Dang Yuqiong Liu Rui Men An Yang