Spaces:
Paused
Paused
| # Paper Index | |
| <Tip warning={true}> | |
| Section under construction. Feel free to contribute! | |
| </Tip> | |
| ## Group Sequence Policy Optimization | |
| **π Paper**: https://huggingface.co/papers/2507.18071 | |
| GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token. To reproduce the paper's setting, use this configuration: | |
| ```python | |
| from trl import GRPOConfig | |
| training_args = GRPOConfig( | |
| importance_sampling_level="sequence", | |
| loss_type="grpo", | |
| beta=0.0, # GSPO set kl regularization to zero: https://github.com/volcengine/verl/pull/2775#issuecomment-3131807306 | |
| epsilon=3e-4, # GSPO paper (v2), section 5.1 | |
| epsilon_high=4e-4, # GSPO paper (v2), section 5.1 | |
| gradient_accumulation_steps=1, | |
| steps_per_generation=4, # partition rollout batch into 4 mini-batches. GSPO paper (v2), section 5.1. Must be 4 times gradient_accumulation_steps | |
| ) | |
| ``` |