About exploration collapse
#6
by
fuyikun
- opened
In your introduction, you mentioned: “Furthermore, to prevent exploration collapse observed in RL training, we reshaped the advantage distribution based on pass rates: amplifying the advantage scale of highly exploratory groups while reducing that of low-exploration ones.” I’m very interested in this part and would like to learn more about how exactly you reshaped the advantage distribution based on pass rates. Could you provide more details about the underlying method or implementation?
technical report coming soon
헤밍웨이 '노인과 바다' 서머리 해줘