@natolambert.bsky.social on Bluesky

JavaScript RequiredThis is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is. Learn more about Bluesky at bsky.social and atproto.com.

Post

Nathan Lambert

natolambert.bsky.social

did:plc:brkj2yocng7vtggmyujy4khq

Does anyone have an intuition or ablation on applying the KL penalty in the loss directly rather than when the reward is computed? How is this changing learning. normal rewards = rewards - self.beta * per_token_kl GRPO impl per_token_loss = pg_loss_max + self.beta * per_token_kl

2025-03-14T20:04:48.096Z