The following stuff is from CS285, lecture 9.
One could also think about Policy Gradient as a soft form of Policy Iteration. Instead of directly changing policy according to the belief of advantage, we just adjust the policy a bit.
This idea would lead to TRPO, PPO and others.
As policy iteration update to be the new , policy iteration is basically optimizing this in the policy improvement part:
Why? This is “pick the best action for each state, based on previous belief of the value functions”.
Recall the real RL objective , we can show that (process omitted):
This means optimizing the RL objective IS the policy iteration objective. And if we swap that to for , we just have our off policy policy gradient back exactly. If we evaluate at , we got policy gradient back.
Wait, isn't policy gradient also directly optimizing RL objective?
You might be wondering why both are optimized directly on the RL objective, but they are different. The reason is that the policy gradient objective is really a first-order approximation. It’s an infinitesimal policy iteration. That’s why if you evaluate it at we get policy gradient objective, a surrogate objective.
Now, why would we want to swap that to for ? If you think about how we really train our model, what we have is current policy, not the “next policy” as we do optimization, and that’s how policy gradient work. And that makes things pretty simple.
Claim: is close to when is close to .
We can prove it the same way we get the error bound of Imitation Learning. The core idea is that suppose the policies are different by total variation distance , and we can show is close.
So we can now say: yes this can be swapped as long as is close.
This is now very similar to Natural Gradient, but hey we are not there yet.
Quick primer of constrained optimization
Suppose you want to solve:
This is a constrained optimization problem. Instead of optimizing inside a hard constraint region, we introduce a penalty variable (the Lagrange multiplier) and form:
Interpretation:
- If the constraint is satisfied, fine.
- If , the penalty term becomes active.
- adjusts how harshly we penalize constraint violation.
So instead of solving a constrained problem directly, we solve a saddle point problem:
This is called the primal–dual formulation. It’s only exact under certain convexity conditions. It always gives an upper bound on the primal optimum.
Now since we are dealing with NN, we can’t solve it nicely. We can use dual gradient descent for this.
Or we can just do second order Taylor expansion and approximate by , which leads to Natural Gradient. This is doable since now we convert to a quadratic function: , which makes it easy to do , solvable in closed form.
If we solve that we got
There’s more in TRPO on how to do efficient Fisher-vector products.