The following stuff is from CS285, lecture 9.
One could also think about Policy Gradient as a soft form of Policy Iteration. Instead of directly changing policy according to the belief of advantage, we just adjust the policy a bit.
This idea would lead to TRPO, PPO and others.
As policy iteration update to be the new , policy iteration is basically optimizing this in the policy improvement part:
Why? This is “pick the best action for each state, based on previous belief of the value functions”.
Recall the real RL objective , we can show that (process omitted):
This means optimizing the RL objective IS the policy iteration objective. And if we swap that to for , we just have our off policy policy gradient back exactly. If we evaluate at , we got policy gradient back.
Wait, isn't policy gradient also directly optimizing RL objective?
You might be wondering why both are optimized directly on the RL objective, but they are different. The reason is that the policy gradient objective is really a first-order approximation. It’s an infinitesimal policy iteration. That’s why if you evaluate it at we get policy gradient objective, a surrogate objective.
Now, why would we want to swap that to for ? If you think about how we really train our model, what we have is current policy, not the “next policy” as we do optimization, and that’s how policy gradient work. And that makes things pretty simple.
Claim: is close to when is close to .
We can prove it the same way we get the error bound of Imitation Learning. The core idea is that suppose the policies are different by total variation distance , and we can show is close.
So we can now say: yes this can be swapped as long as is close.
This is now very similar to Natural Gradient, but hey we are not there yet.
Quick primer of constrained optimization
Suppose you want to solve:
This is a constrained optimization problem. Instead of optimizing inside a hard constraint region, we introduce a penalty variable (the Lagrange multiplier) and form:
Interpretation:
- If the constraint is satisfied, fine.
- If , the penalty term becomes active.
- adjusts how harshly we penalize constraint violation.
So instead of solving a constrained problem directly, we solve a saddle point problem:
This is called the primal–dual formulation. It’s only exact under certain convexity conditions. It always gives an upper bound on the primal optimum.
Now since we are dealing with NN, we can’t solve it nicely. We can use dual gradient descent for this.
Or we can just do second order Taylor expansion and approximate by , Fisher information, which leads to Natural Gradient. This is doable since now we convert to a quadratic function: , which makes it easy to do , solvable in closed form.
If we solve that we got
There’s more in TRPO on how to do efficient Fisher-vector products.
The following is from a conversation with Claude Sonnet 4.6, when reading PPO paper.
Efficient Fisher-Vector Products via Conjugate Gradient
is — impossible to store or invert for any real network. Instead, TRPO reformulates as solving the linear system using conjugate gradient (CG), which only needs matrix-vector products , never itself.
Computing for arbitrary is cheap. Since :
This requires two backward passes (or one forward-over-backward autodiff pass) — no storage. CG runs ~10 iterations, each needing one product, to approximate . Then is computed from the closed-form formula above.
Cost
~10x more expensive than a plain SGD step, plus requires custom autodiff infrastructure. This is exactly what PPO wanted to escape.
Limitations of TRPO
Incompatible with parameter sharing. The KL constraint is defined purely over policy outputs. When policy and value function share parameters, a gradient step satisfying the KL constraint may cause a large unconstrained update to the value head — or the value loss gradient may yank shared parameters in a way that violates the policy’s KL budget. There’s no clean way to enforce the constraint on the policy while jointly optimizing a value loss through shared weights.
Incompatible with dropout. CG requires multiple forward passes to compute Fisher-vector products. Dropout samples a different mask each pass, so each CG iteration is computing for a different network. CG’s convergence guarantee assumes repeated multiplication by the same matrix — dropout breaks this structurally, not just noisily. Freezing the mask is a workaround but adds yet more implementation complexity.
The approximation is imperfect anyway. The Fisher approximation (quadratic KL) and finite CG iterations mean TRPO doesn’t actually solve the constrained problem exactly. The complexity cost is real; the theoretical guarantee is approximate.
Where does the penalty form come from?
The theory (an error bound proof, analogous to Imitation Learning) shows there exists some such that optimizing surrogate guarantees monotonic improvement. The absorbs horizon, discount, and bound constants — it’s problem-dependent and changes over training. This is why TRPO uses the hard constraint form instead: is far more stable to tune than . See PPO for how this tension is resolved differently.