Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Abstract
Swarm sAmpling Policy Optimization (SAPO) is a decentralized and asynchronous RL algorithm that enhances post-training language models without supervised fine-tuning, achieving significant reward gains and scalability across diverse hardware.
Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.
Community
We introduce SAPO (Swarm sAmpling Policy Optimization) - a decentralised RL post-training method where models share experiences to learn faster, together.
The problem: Scaling RL for LMs is costly and fragile.
Clusters must stay in sync, communication bottlenecks grow, and infrastructure overhead skyrockets.
SAPO flips the model - instead of syncing weights, nodes share decoded rollouts. Lightweight, async, and resilient.
Why it matters:
– No synchronisation overhead
– Works across heterogeneous devices (servers, laptops, anything)
– “Aha moments” on one node propagate through the swarm
– Opens RL post-training to maximum scale
Results:
– Controlled experiments saw 94% reward improvement over baseline with balanced sharing (4 local / 4 external)
– Thousands of community nodes validated SAPO in a live demo
– Collective training = faster, stronger learning
SAPO shows that sharing experience beats scaling alone.
Decentralised communities of models — and people — can push reasoning further than any single system.
Participate in future research by running an RL Swarm node on your own hardware: https://github.com/gensyn-ai/rl-swarm
We work harder because we trust the team.
SAPO !
The idea of “sharing rollouts” seems like a simple way to scale without huge infrastructure costs.
we trusted
The possibility of spreading “Aha moments” between nodes is a cool idea that could speed up learning.
Achieving up to 94% reward improvement in controlled settings is impressive.
SAPO !
Innovative approach to decentralized RL post-training.
gensyn have solid team crazy work lfg
I don't understand too much but looks good let's goo
Sapo is way to goooo
don't know at all what is happening here, but I definitely know that the team is going to do something great
Solid work team! LFG
We have an enthusiastic + open community (who contributed to this research by helping us scale the experiments in a fully open + collaborative way) - likely not bots but participants.
Agree that unsubstantial comments drown out the interesting discussion though, sadly.
i believe the team
This looks high-tech!
Really exciting work sapo is decentralized approach tackles the scalability bottlenecks in RL post-training head-on. The idea of propagating ‘aha moments’ across heterogeneous nodes feels like a big step toward more open and efficient collective model improvement.
great team..
nice job..
Really enjoying the testnet so far! The idea of sharing rollouts across nodes feels very natural, and it’s exciting to see how well SAPO scales with different hardware setups.
It's very exciting to see AIs take off like this, and I'm very happy to see Gensyn achieve this
team is working very good, keep going
We will find the heart of artificial intelligence with Gensyn. Great work. Thank you.
Great things are happening, the team worked really well.
LFG team
We work harder because we trust the team.
W paper
Wen AMA ?
more work more models more ai and always gensyn
Just finished reading. Honestly, that chart showing Qwen2.5 0.5b with sapo vs isolated is mind blowing. I helped train it 😎
i believe the team too
hardworking and successful team.
we support the team and their efforts
Bots army floods the comment section. The comment section is supposed to hold discussions of the papar itself, not anything else.
Commented elsewhere but just a note that this work was done as a huge open collaboration with many participants volunteering their time, effort, and devices.
Our community lives here and is open to anyone to join!
As an AI enthusiast, I find this approach really exciting. The idea of models improving by sharing their experiences across different machines feels like a big step toward more collaborative and scalable AI.
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/sharing-is-caring-efficient-lm-post-training-with-collective-rl-experience-sharing
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- URPO: A Unified Reward & Policy Optimization Framework for Large Language Models (2025)
- Group Expectation Policy Optimization for Heterogeneous Reinforcement Learning (2025)
- Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards (2025)
- MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement (2025)
- Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback (2025)
- COPO: Consistency-Aware Policy Optimization (2025)
- A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I went through this paper and it’s really interesting for Gensyn. The idea of sharing rollouts instead of syncing models fits perfectly with community-driven compute. It’s smart because it lets people with different hardware contribute without much coordination. The demo with thousands of nodes shows it’s working in real life, which is great.
However, trust and privacy are big challenges — the system needs ways to filter bad data and protect sensitive info. Also, it would be good to see more tests across tasks and models. Overall, this approach can help Gensyn scale better if safety and reward structures are handled well.
Really cool to see SAPO in action with real community hardware! Love that it works even when nodes are slow or go offline. Feels like a step toward truly open and resilient RL training. Big fan of the "Aha moments" idea - almost like the network is learning together, not just scaling up. LFG
sapo in action
SAPO is a game-changer for LLMs
In gensyn's experiments, SAPO made the machines 94% better at earning rewards, It’s like a boost button for learning. They shared cool insights from a big open-source demo where everyone pitched in. This proves SAPO can handle a huge set of machines, making AI training cheaper and faster.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper