TRPO

The following stuff is from CS285, lecture 9.

One could also think about Policy Gradient as a soft form of Policy Iteration. Instead of directly changing policy according to the belief of advantage, we just adjust the policy a bit.

This idea would lead to TRPO, PPO and others.

As policy iteration update $θ$ to be the new $θ^{'}$ , policy iteration is basically optimizing this in the policy improvement part:

E_{τ \sim p_{θ^{'}} (τ)} [t \sum γ^{t} A^{π_{θ}} (s_{t}, a_{t})]

Why? This is “pick the best action for each state, based on previous belief of the value functions”.

Recall the real RL objective $J (θ)$ , we can show that (process omitted):

J (θ) J (θ^{'}) - J (θ) = E_{τ \sim p_{θ} (τ)} [t \sum γ^{t} r (s_{t}, a_{t})] = E_{τ \sim p_{θ^{'}} (τ)} [t \sum γ^{t} A^{π_{θ}} (s_{t}, a_{t})] = t \sum E_{s_{t} \sim p_{θ^{'}} (s_{t})} [E_{a_{t} \sim π_{θ} (a_{t} ∣ s_{t})} [\frac{π _{θ^{'}} ( a _{t} ∣ s _{t} )}{π _{θ} ( a _{t} ∣ s _{t} )} γ^{t} A^{π_{θ}} (s_{t}, a_{t})]]

This means optimizing the RL objective IS the policy iteration objective. And if we swap that $p_{θ^{'}}$ to $p_{θ}$ for $s_{t}$ , we just have our off policy policy gradient back exactly. If we evaluate at $θ = θ^{'}$ , we got policy gradient back.

Wait, isn't policy gradient also directly optimizing RL objective?

You might be wondering why both are optimized directly on the RL objective, but they are different. The reason is that the policy gradient objective is really a first-order approximation. It’s an infinitesimal policy iteration. That’s why if you evaluate it at $θ = θ^{'}$ we get policy gradient objective, a surrogate objective.

Now, why would we want to swap that $p_{θ^{'}}$ to $p_{θ}$ for $s_{t}$ ? If you think about how we really train our model, what we have is current policy, not the “next policy” as we do optimization, and that’s how policy gradient work. And that makes things pretty simple.

Claim: $p_{θ} (s_{t})$ is close to $p_{θ^{'}} (s_{t})$ when $π_{θ}$ is close to $π_{θ^{'}}$ .

We can prove it the same way we get the error bound of Imitation Learning. The core idea is that suppose the policies are different by total variation distance $ϵ$ , and we can show $\sum_{t} E_{s_{t} \sim π_{θ} (a_{t} ∣ s_{t})}$ is close.

So we can now say: yes this can be swapped as long as $π_{θ}$ is close.

θ^{'} \leftarrow ar g θ^{'} max such that t \sum E_{s_{t} \sim p_{θ} (s_{t})} [E_{a_{t} \sim π_{θ} (a_{t} ∣ s_{t})} [\frac{π _{θ^{'}} ( a _{t} ∣ s _{t} )}{π _{θ} ( a _{t} ∣ s _{t} )} γ^{t} A^{π_{θ}} (s_{t}, a_{t})]] D_{KL} (π_{θ^{'}} (a_{t} ∣ s_{t}) ∥ π_{θ} (a_{t} ∣ s_{t})) \leq ϵ

This is now very similar to Natural Gradient, but hey we are not there yet.

Quick primer of constrained optimization

Suppose you want to solve:
$θ^{'} max f (θ^{'}) s.t. g (θ^{'}) \leq ε$
This is a constrained optimization problem. Instead of optimizing inside a hard constraint region, we introduce a penalty variable (the Lagrange multiplier) $λ \geq 0$ and form:
$L (θ^{'}, λ) = f (θ^{'}) - λ (g (θ^{'}) - ε)$
Interpretation:

If the constraint is satisfied, fine.

If $g (θ^{'}) > ε$ , the penalty term becomes active.

$λ$ adjusts how harshly we penalize constraint violation.

So instead of solving a constrained problem directly, we solve a saddle point problem:
$θ^{'} max λ \geq 0 min L (θ^{'}, λ)$
This is called the primal–dual formulation. It’s only exact under certain convexity conditions. It always gives an upper bound on the primal optimum.

Now since we are dealing with NN, we can’t solve it nicely. We can use dual gradient descent for this.

Or we can just do second order Taylor expansion and approximate $K L$ by $F$ , Fisher information, which leads to Natural Gradient. This is doable since now we convert $K L$ to a quadratic function: $\frac{1}{2} Δ θ^{T} H Δ θ$ , which makes it easy to do $\nabla_{θ^{'}}$ , solvable in closed form.

If we solve that we got

θ^{'} α = θ + α F^{- 1} \nabla_{θ} J (θ) = \frac{2 ϵ}{\nabla _{θ} J ( θ ) ^{T} F \nabla _{θ} J ( θ )}

There’s more in TRPO on how to do efficient Fisher-vector products.

The following is from a conversation with Claude Sonnet 4.6, when reading PPO paper.

Efficient Fisher-Vector Products via Conjugate Gradient

$F$ is $∣ θ ∣ \times ∣ θ ∣$ — impossible to store or invert for any real network. Instead, TRPO reformulates $F^{- 1} g$ as solving the linear system $F x = g$ using conjugate gradient (CG), which only needs matrix-vector products $F v$ , never $F$ itself.

Computing $F v$ for arbitrary $v$ is cheap. Since $F = E [\nabla lo g π \cdot \nabla lo g π^{T}]$ :

F v = E [(\nabla lo g π^{T} v) \nabla lo g π]

This requires two backward passes (or one forward-over-backward autodiff pass) — no $∣ θ ∣^{2}$ storage. CG runs ~10 iterations, each needing one $F v$ product, to approximate $F^{- 1} g$ . Then $α$ is computed from the closed-form formula above.

Cost

~10x more expensive than a plain SGD step, plus requires custom autodiff infrastructure. This is exactly what PPO wanted to escape.

Limitations of TRPO

Incompatible with parameter sharing. The KL constraint is defined purely over policy outputs. When policy and value function share parameters, a gradient step satisfying the KL constraint may cause a large unconstrained update to the value head — or the value loss gradient may yank shared parameters in a way that violates the policy’s KL budget. There’s no clean way to enforce the constraint on the policy while jointly optimizing a value loss through shared weights.

Incompatible with dropout. CG requires multiple forward passes to compute Fisher-vector products. Dropout samples a different mask each pass, so each CG iteration is computing $F v$ for a different network. CG’s convergence guarantee assumes repeated multiplication by the same matrix $F$ — dropout breaks this structurally, not just noisily. Freezing the mask is a workaround but adds yet more implementation complexity.

The approximation is imperfect anyway. The Fisher approximation (quadratic KL) and finite CG iterations mean TRPO doesn’t actually solve the constrained problem exactly. The complexity cost is real; the theoretical guarantee is approximate.

Where does the penalty form $β \cdot KL$ come from?

The theory (an error bound proof, analogous to Imitation Learning) shows there exists some $β$ such that optimizing surrogate $- β \cdot KL$ guarantees monotonic improvement. The $β$ absorbs horizon, discount, and bound constants — it’s problem-dependent and changes over training. This is why TRPO uses the hard constraint form instead: $ϵ$ is far more stable to tune than $β$ . See PPO for how this tension is resolved differently.

Yanda's Random Notes

Explorer

TRPO

Efficient Fisher-Vector Products via Conjugate Gradient

Limitations of TRPO

Graph View

Table of Contents

Backlinks