I thought PPO is so daunting as it’s used by every LLM and some robotics frontier research. It turns out it’s actually much simpler than TRPO, just like how GRPO is simpler than PPO.

The paper itself is easy to read. Other than that there’s also OpenAI Spinning up which offers simpler formulas (why don’t they include that in the paper? I derived them myself anyway since it was confusing) and explanations, and the A Primer on LLM Post-Training on PyTorch blog. Note that blog contains invalid opinions, e.g. when explaining why there’s the min in the PPO formula. I think the author confuses themselves.

Anyway, let’s get started!


The following note come from my conversation with Claude Sonnet 4.6


PPO asks the same question as TRPO: how do we take the biggest possible improvement step without accidentally collapsing performance? Where TRPO answers with complex second-order machinery, PPO is a family of first-order methods that enforce the trust region through the loss function shape itself — no Fisher matrix, no conjugate gradient, just plain SGD/Adam.

There are two variants: PPO-Clip (primary) and PPO-Penalty (adaptive KL). This note focuses on PPO-Clip.

The Clipped Surrogate Objective

Let be the probability ratio. The raw (unclipped) surrogate is just — this is , the policy gradient objective with importance sampling.

The PPO-Clip objective is:

where the clipping function is:

This simplified form (from SpinningUp) is equivalent to the original paper’s but makes the intent clearer.

Why min() on top of clipping?

Clipping alone just flatlines the objective at the boundary — it stops rewarding further movement, but doesn’t penalize overshooting. The min() makes it a pessimistic lower bound: it reintroduces the worse (unclipped) value whenever you’ve moved so far that it’s worse than the clipped version. The two cases make this concrete:

When (action was good, increase its probability):

  • Past , the clipped term flatlines
  • The unclipped term keeps growing — so min = clipped (flatline) ✓
  • No gradient incentive to keep pushing above

When (action was bad, decrease its probability):

  • Past , the clipped term flatlines
  • The unclipped term keeps getting more negative — so min = unclipped (penalty) ✓
  • Overshoot is still penalized even though you’re outside the clip band

Equivalently, you could write it with a conditional — only clip the side that would let the objective keep improving:

The min() formulation is just a branchless way to express this.

Why This Works Better Than TRPO in Practice

But TRPO has monotonic improvement guarantees — how can PPO beat it?

The theoretical bound is extremely loose. What matters empirically is “does this update move in a good direction without destabilizing training?” Clipping is a robust heuristic for this. PPO also benefits from the multiple SGD epochs squeezing more signal from each batch of environment data, which TRPO simply can’t do. The complexity TRPO pays for its guarantee buys very little in practice. I pressed Sonnet 4.6 really hard and it admit it’s all empirical.

PPO-Penalty (Adaptive KL)

Instead of clipping, penalize KL directly:

is adjusted each update:

  • If : increase
  • If : decrease
  • Otherwise: leave it

Note

This is basically bang-bang control with a deadband — a discrete switching rule on threshold crossings. The paper admits -tuning is fiddly, and PPO-Clip empirically outperforms this variant. The more principled version of adaptive is dual gradient descent (gradient ascent on the Lagrange multiplier), which has actual convergence theory and shows up in constrained RL (CMDPs) and RLHF KL tuning.

Entropy Bonus

The full objective often adds an entropy term:

where .

This prevents policy collapse — gradient descent naturally wants to concentrate probability on the best-so-far action, which kills exploration. High entropy = spread distribution (exploratory). The entropy bonus penalizes overconfident policies, keeping exploration alive longer.

This is the same regularization intuition as load balancing loss in Mixture of Experts: the primary objective has no incentive to maintain diversity, so you add a term that explicitly fights the degenerate low-entropy solution. The difference is what “collapse” means — MoE collapses across experts for a token; policy gradient collapses across actions for a state.

\begin{algorithm}
\begin{algorithmic}
\REQUIRE initial policy parameters $\theta_{0}$, initial value function parameters $\phi_{0}$
\FOR{$k = 0, 1, 2, \dots$}
    \STATE Collect set of trajectories $\mathcal{D}_{k} = \{\tau_{i}\}$ by running policy $\pi_{k} = \pi(\theta_{k})$ in the environment.
    \STATE Compute rewards-to-go $\hat{R}_{t}$.
    \STATE Compute advantage estimates, $\hat{A}_{t}$ (using any method of advantage estimation) based on the current value function $V_{\phi_{k}}$.
    \STATE Update the policy by maximizing the PPO-Clip objective:
    \STATE $$\theta_{k+1} = \arg \max_{\theta} \frac{1}{|\mathcal{D}_{k}|T} \sum_{\tau \in \mathcal{D}_{k}} \sum_{t=0}^{T} \min \left( \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{k}}(a_{t}|s_{t})} A^{\pi_{\theta_{k}}}(s_{t}, a_{t}), g(\epsilon, A^{\pi_{\theta_{k}}}(s_{t}, a_{t})) \right)$$
    \STATE \textit{typically via stochastic gradient ascent with Adam.}
    \STATE Fit value function by regression on mean-squared error:
    \STATE $$\phi_{k+1} = \arg \min_{\phi} \frac{1}{|\mathcal{D}_{k}|T} \sum_{\tau \in \mathcal{D}_{k}} \sum_{t=0}^{T} \left( V_{\phi}(s_{t}) - \hat{R}_{t} \right)^{2}$$
    \STATE \textit{typically via some gradient descent algorithm.}
\ENDFOR
\end{algorithmic}
\end{algorithm}