HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via
  Multi-Agent Multi-Turn Reinforcement Learning

Abstract

Recent advances in reinforcement learning have shown that language models candevelop sophisticated reasoning through training on tasks with verifiablerewards, but these approaches depend on human-curated problem-answer pairs anddomain-specific reward engineering. We introduce SPIRAL, a self-play frameworkwhere models learn by playing multi-turn, zero-sum games against continuouslyimproving versions of themselves, eliminating the need for human supervision.Through self-play, SPIRAL generates an infinite curriculum of progressivelychallenging problems as models must constantly adapt to stronger opponents. Toenable this self-play training at scale, We implement a fully online,multi-turn, multi-agent reinforcement learning system for LLMs and proposerole-conditioned advantage estimation (RAE) to stabilize multi-agent training.Using SPIRAL, self-play on zero-sum games produces reasoning capabilities thattransfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6%improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000expert game trajectories. Analysis reveals that this transfer occurs throughthree cognitive patterns: systematic decomposition, expected value calculation,and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, SimpleNegotiation) further enhances performance as each game develops distinctreasoning strengths. Applying SPIRAL to a strong reasoning model(DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. Theseresults demonstrate that zero-sum games naturally develop transferablereasoning capabilities, highlighting a promising direction for autonomousreasoning development.

Code Repositories

spiral-rl/spiral
Official
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning | Papers | HyperAI