2 months ago

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Jeffrey Amico Gabriel Passamani Andrade John Donaghy Ben Fielding Tristin Forbus Harry Grieve Semih Kara Jari Kolehmainen Yihua Lou Christopher Nies

Abstract

Post-training language models (LMs) with reinforcement learning (RL) canenhance their complex reasoning capabilities without supervised fine-tuning, asdemonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMsrequires significant parallelization to scale-up inference, which introducesnon-trivial technical challenges (e.g. latency, memory, and reliability)alongside ever-growing financial costs. We present Swarm sAmpling PolicyOptimization (SAPO), a fully decentralized and asynchronous RL post-trainingalgorithm. SAPO is designed for decentralized networks of heterogenous computenodes, where each node manages its own policy model(s) while "sharing" rolloutswith others in the network; no explicit assumptions about latency, modelhomogeneity, or hardware are required and nodes can operate in silo if desired.As a result, the algorithm avoids common bottlenecks in scaling RLpost-training while also allowing (and even encouraging) new possibilities. Bysampling rollouts "shared" across the network, it enables "Aha moments" topropagate, thereby bootstrapping the learning process. In this paper we showSAPO achieved cumulative reward gains of up to 94% in controlled experiments.We also share insights from tests on a network with thousands of nodescontributed by Gensyn community members running the algorithm on diversehardware and models during an open-source demo.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Jeffrey Amico Gabriel Passamani Andrade John Donaghy Ben Fielding Tristin Forbus Harry Grieve Semih Kara Jari Kolehmainen Yihua Lou Christopher Nies5 more

Abstract

Build AI with AI

Hyper Newsletters

Jeffrey Amico Gabriel Passamani Andrade John Donaghy Ben Fielding Tristin Forbus Harry Grieve Semih Kara Jari Kolehmainen Yihua Lou Christopher Nies