Natural Policy Gradient

Source is the same as Policy Gradient

If you think about how we optimize gradient normally it actually doesn’t make much sense: some parameters change probabilities a lot more than others, but we scale them the same (vanilla SGD)

The optimization can be seen as a constraint optimization problem: $θ^{'} \leftarrow ar g θ^{'} max (θ^{'} - θ)^{T} \nabla_{θ} J (θ) s.t. controls how far we go ∥ θ^{'} - θ ∥^{2} \leq ϵ$ In the 2D case, that’s basically saying how far we should go on the x-axis so that the first-order gradient times $δ x$ (which gives us y) descends the most. We do not want to go too far because the first-order approximation is only valid within a small range.

Really we should be constrained by “output does not change too much”, not “all params in all dims does not change too much”. $θ^{'} \leftarrow ar g θ^{'} max (θ^{'} - θ)^{T} \nabla_{θ} J (θ) s.t. D (π_{θ^{'}}, π_{θ}) \leq ϵ parameterization-independent divergence measure usually KL-divergence: D_{KL} (π_{θ^{'}} ∥ π_{θ}) = E_{π_{θ^{'}}} [lo g π_{θ} - lo g π_{θ^{'}}] D_{KL} (π_{θ^{'}} ∥ π_{θ}) \approx (θ^{'} - θ)^{T} F (θ^{'} - θ) F = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s) \nabla_{θ} lo g π_{θ} (a ∣ s)^{T}] Fisher-information matrix$ So here comes KL Divergence, if we Taylor expand that, we’ll see the first two terms are all 0, and it can be approxiamated just by the second term, the Fisher-information matrix, and it’s the Reimannian metric on the manifold of probability distributions. $θ^{'} θ \leftarrow ar g θ^{'} max (θ^{'} - θ)^{T} \nabla_{θ} J (θ) s.t. ∥ θ^{'} - θ ∥_{F}^{2} \leq ϵ \leftarrow θ + α F^{- 1} \nabla_{θ} J (θ)$ We’ll see later how TRPO is based on this idea.

Yanda's Random Notes

Explorer

Natural Policy Gradient

Graph View

Backlinks