arxiv:2509.08721

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Published on Sep 10

· Submitted by

Ben on Sep 10

#1 Paper of the day

Upvote

673

Authors:

Gabriel Passamani Andrade ,

John Donaghy ,

Ben Fielding ,

Tristin Forbus ,

Harry Grieve ,

Semih Kara ,

Edward Phillip Flores Nuño ,

Diogo Ortega ,

Shikhar Rastogi ,

Austin Virts ,

Abstract

Swarm sAmpling Policy Optimization (SAPO) is a decentralized and asynchronous RL algorithm that enhances post-training language models without supervised fine-tuning, achieving significant reward gains and scalability across diverse hardware.

AI-generated summary

Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.

View arXiv page View PDF Project page GitHub 1.57k Add to collection

Community

benfielding

Paper author Paper submitter Sep 11

We introduce SAPO (Swarm sAmpling Policy Optimization) - a decentralised RL post-training method where models share experiences to learn faster, together.

The problem: Scaling RL for LMs is costly and fragile.

Clusters must stay in sync, communication bottlenecks grow, and infrastructure overhead skyrockets.

SAPO flips the model - instead of syncing weights, nodes share decoded rollouts. Lightweight, async, and resilient.

Why it matters:
– No synchronisation overhead
– Works across heterogeneous devices (servers, laptops, anything)
– “Aha moments” on one node propagate through the swarm
– Opens RL post-training to maximum scale

Results:
– Controlled experiments saw 94% reward improvement over baseline with balanced sharing (4 local / 4 external)
– Thousands of community nodes validated SAPO in a live demo
– Collective training = faster, stronger learning

SAPO shows that sharing experience beats scaling alone.

Decentralised communities of models — and people — can push reasoning further than any single system.

Participate in future research by running an RL Swarm node on your own hardware: https://github.com/gensyn-ai/rl-swarm

salakmisinx

Sep 11

We work harder because we trust the team.

Ullf

Sep 11

Interesting how SAPO addresses latency and synchronization in distributed RL for LMs.

Osman12Hector

Sep 11

•

edited Sep 11

SAPO !

megafigh

Sep 11

The idea of “sharing rollouts” seems like a simple way to scale without huge infrastructure costs.

magag

Sep 11

The decentralized and asynchronous approach looks promising for heterogeneous hardware.

Osman12Hector

Sep 11

we trusted

leanamo

Sep 11

The possibility of spreading “Aha moments” between nodes is a cool idea that could speed up learning.

Rodellos

Sep 11

Achieving up to 94% reward improvement in controlled settings is impressive.

kornusi

Sep 11

•

edited Sep 11

Curious how SAPO handles heterogeneous hardware different nodes with different capacities.

Osman12Hector

Sep 11

SAPO !

coppertoy

Sep 11

Innovative approach to decentralized RL post-training.

0xGareeb

Sep 11

gensyn have solid team crazy work lfg

theprofessor930

Sep 11

I don't understand too much but looks good let's goo

Obaidreal

Sep 11

Sapo is way to goooo

gswarm1

Sep 11

don't know at all what is happening here, but I definitely know that the team is going to do something great

Pancrasanicet

Sep 11

Solid work team! LFG

n1cck

Sep 11

Too many robots in a single post

benfielding

Paper author Sep 12

We have an enthusiastic + open community (who contributed to this research by helping us scale the experiments in a fully open + collaborative way) - likely not bots but participants.

Agree that unsubstantial comments drown out the interesting discussion though, sadly.

xdsaurabh177

Sep 11

i believe the team

TiMOld

Sep 11

This looks high-tech!

mehmetxh

Sep 11

Really exciting work sapo is decentralized approach tackles the scalability bottlenecks in RL post-training head-on. The idea of propagating ‘aha moments’ across heterogeneous nodes feels like a big step toward more open and efficient collective model improvement.

ekohardc

Sep 11

great team..
nice job..

chaund

Sep 11

Really enjoying the testnet so far! The idea of sharing rollouts across nodes feels very natural, and it’s exciting to see how well SAPO scales with different hardware setups.

moscowx21

Sep 11

It's very exciting to see AIs take off like this, and I'm very happy to see Gensyn achieve this

mikahw

Sep 11

team is working very good, keep going

singlegun

Sep 11

We will find the heart of artificial intelligence with Gensyn. Great work. Thank you.

tahakurt

Sep 11

Great things are happening, the team worked really well.

ceylanmodas

Sep 11

LFG team

salakmisinx

Sep 11

We work harder because we trust the team.

EditaZ

Sep 11

W paper

Obaidreal

Sep 11

Wen AMA ?

Osman12Hector

Sep 11

more work more models more ai and always gensyn

ongon

Sep 11

Just finished reading. Honestly, that chart showing Qwen2.5 0.5b with sapo vs isolated is mind blowing. I helped train it 😎

caokun215

Sep 11

i believe the team too

ecamli

Sep 11

hardworking and successful team.

kayacrypto

Sep 11

we support the team and their efforts

Enigrand

Sep 11

•

edited Sep 11

Bots army floods the comment section. The comment section is supposed to hold discussions of the papar itself, not anything else.

benfielding

Paper author Sep 12

Commented elsewhere but just a note that this work was done as a huge open collaboration with many participants volunteering their time, effort, and devices.

Our community lives here and is open to anyone to join!

rambetiko

Sep 11

As an AI enthusiast, I find this approach really exciting. The idea of models improving by sharing their experiences across different machines feels like a big step toward more collaborative and scalable AI.

grantsing

Sep 12

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/sharing-is-caring-efficient-lm-post-training-with-collective-rl-experience-sharing

librarian-bot

Sep 12

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

demonstorm8

Sep 12

I went through this paper and it’s really interesting for Gensyn. The idea of sharing rollouts instead of syncing models fits perfectly with community-driven compute. It’s smart because it lets people with different hardware contribute without much coordination. The demo with thousands of nodes shows it’s working in real life, which is great.
However, trust and privacy are big challenges — the system needs ways to filter bad data and protect sensitive info. Also, it would be good to see more tests across tasks and models. Overall, this approach can help Gensyn scale better if safety and reward structures are handled well.

Iraex

Sep 12

Really cool to see SAPO in action with real community hardware! Love that it works even when nodes are slow or go offline. Feels like a step toward truly open and resilient RL training. Big fan of the "Aha moments" idea - almost like the network is learning together, not just scaling up. LFG

acidjp

Sep 16

sapo in action

aniutah93

Sep 17

SAPO is a game-changer for LLMs

gudman1

Sep 26

In gensyn's experiments, SAPO made the machines 94% better at earning rewards, It’s like a boost button for learning. They shared cool insights from a big open-source demo where everyone pitched in. This proves SAPO can handle a huge set of machines, making AI training cheaper and faster.