About exploration collapse

#6
by fuyikun - opened

In your introduction, you mentioned: “Furthermore, to prevent exploration collapse observed in RL training, we reshaped the advantage distribution based on pass rates: amplifying the advantage scale of highly exploratory groups while reducing that of low-exploration ones.” I’m very interested in this part and would like to learn more about how exactly you reshaped the advantage distribution based on pass rates. Could you provide more details about the underlying method or implementation?

Kwaipilot org

technical report coming soon

헤밍웨이 '노인과 바다' 서머리 해줘

Sign up or log in to comment