Source is the same as Policy Gradient

If you think about how we optimize gradient normally it actually doesn’t make much sense: some parameters change probabilities a lot more than others, but we scale them the same (vanilla SGD)

The optimization can be seen as a constraint optimization problem: In the 2D case, that’s basically saying how far we should go on the x-axis so that the first-order gradient times (which gives us y) descends the most. We do not want to go too far because the first-order approximation is only valid within a small range.

Really we should be constrained by “output does not change too much”, not “all params in all dims does not change too much”. So here comes KL Divergence, if we Taylor expand that, we’ll see the first two terms are all 0, and it can be approxiamated just by the second term, the Fisher-information matrix, and it’s the Reimannian metric on the manifold of probability distributions. We’ll see later how TRPO is based on this idea.