Policy Gradient

Source: The 2023 version of CS285, lecture 5.

REINFORCE

Policy gradient is direct optimization: for Function approximation we are just going to find the $θ$ such that:

θ^{⋆} = ar g θ max J (θ) E_{τ \sim p_{θ} (τ)} [t \sum r (s_{t}, a_{t})]

J (θ) = E_{τ \sim p_{θ} (τ)} [r (τ) \sum_{t = 1}^{T} r (s_{t}, a_{t})] = \int p_{θ} (τ) r (τ) d τ

So $τ$ here is basically the whole trajectory. We are just trying to compute the derivative of $J (θ)$ w.r.t. $θ$ . Now, there is this expectation here, and it’s hard for us to get the gradient on that.

\nabla_{θ} J (θ) = \int \nabla_{θ} p_{θ} (τ) r (τ) d τ = \int p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ) r (τ) d τ = E_{τ \sim p_{θ} (τ)} [\nabla_{θ} lo g p_{θ} (τ) r (τ)]

This is super nice, since well, we basically want to have a $p_{θ} (τ)$ inside that $\int$ . This is because well there’s a convenient property:

p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ) = p_{θ} (τ) \frac{\nabla _{θ} p _{θ} ( τ )}{p _{θ} ( τ )} = \nabla_{θ} p_{θ} (τ)

Let’s then look at that $lo g p_{θ} (τ)$ . What’s that exactly?

\nabla_{θ} [lo g p (s_{1}) + t = 1 \sum T lo g π_{θ} (a_{t} ∣ s_{t}) + lo g p (s_{t + 1} ∣ s_{t}, a_{t})]

The parts are canceled since they are not related to $θ$ . Finally, we got

\nabla_{θ} J (θ) = E_{τ \sim p_{θ} (τ)} [(t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})) (t = 1 \sum T r (s_{t}, a_{t}))]

REINFORCE algorithm:

sample ${τ^{i}}$ from $π_{θ} (a_{t} ∣ s_{t})$ (run the policy)
$\nabla_{θ} J (θ) \approx \sum_{i} (\sum_{t} \nabla_{θ} lo g π_{θ} (a_{t}^{i} ∣ s_{t}^{i})) (\sum_{t} r (s_{t}^{i}, a_{t}^{i}))$
$θ \leftarrow θ + α \nabla_{θ} J (θ)$

We are seeing $\nabla_{θ} lo g p_{θ} (x)$ a lot recently! See Score function note for more.

More interpretation

Now look at Maximum likelihood, say we are doing Imitation Learning:

\nabla_{θ} J_{M L} (θ) \approx \frac{1}{N} i = 1 \sum N (t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t}))

Well that’s very much just our REINFORCE formula, but just without the $r$ bit. You can say policy gradient is weighting the experience with the reward received, trial and error.

Note that the Markov property is not exploited in the derivation.

Problems

Note that we are rating the trajectory via reward. If we have three samples, [-2, 1, 2], and that should be equivalent to [1, 4, 5] (see baseline later), but variance wise that’s different. With finite samples, these do mean different things to the estimation.

Causality

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N (t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t})) (t = 1 \sum T r (s_{i, t}, a_{i, t}))

Causality: policy at time $t^{'}$ cannot affect reward at time $t$ when $t < t^{'}$

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t}) "reward to go" \hat{Q}_{i, t} (t^{'} = t \sum T r (s_{i, t^{'}}, a_{i, t^{'}}))

We are writing out the expectation because then we can clearly see that one trajectory is really from 1 to time $T$ , and use the distributive law. That $Q$ is a hint to the value function.

Baseline

Note $E [\nabla_{θ} lo g p_{θ} (τ) b] = 0$ , so we can just reduce a baseline to the rewards

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N \nabla_{θ} lo g p_{θ} (τ) [r (τ) - b]

We can get it by average or other ways, for example, just find out what’s the best $b$ , solving $\frac{d Va r}{d b} = 0$ . We can get that $b = \frac{E [ g ( τ ) ^{2} r ( τ )]}{E [ g ( τ ) ^{2} ]}$ , expected reward weighted by gradient magnitudes.

Off Policy

Refer to Off-Policy for some background, and we know a way to make on-policy method work for off policy is Importance sampling corrections

J (θ^{'}) = E_{τ \sim p_{θ} (τ)} [\frac{p _{θ^{'}} ( τ )}{p _{θ} ( τ )} r (τ)]

\nabla_{θ^{'}} J (θ^{'}) = E_{τ \sim p_{θ} (τ)} [\frac{p _{θ^{'}} ( τ )}{p _{θ} ( τ )} \nabla_{θ^{'}} lo g π_{θ^{'}} (τ) r (τ)] when θ \neq = θ^{'} = E_{τ \sim p_{θ} (τ)} [(t = 1 \prod T \frac{π _{θ^{'}} ( a _{t} ∣ s _{t} )}{π _{θ} ( a _{t} ∣ s _{t} )}) (t = 1 \sum T \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{t} ∣ s_{t})) (t = 1 \sum T r (s_{t}, a_{t}))] = E_{τ \sim p_{θ} (τ)} t = 1 \sum T \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{t} ∣ s_{t}) \underline{t^{'} = 1 \prod t \frac{π _{θ^{'}} ( a _{t^{'}} ∣ s _{t^{'}} )}{π _{θ} ( a _{t^{'}} ∣ s _{t^{'}} )}} t^{'} = t \sum T r (s_{t^{'}}, a_{t^{'}}) t^{''} = t \prod t^{'} \frac{π _{θ^{'}} ( a _{t^{''}} ∣ s _{t^{''}} )}{π _{θ} ( a _{t^{''}} ∣ s _{t^{''}} )}

The last one is taking causality into account. The can be crossed out for simplicity, which makes it more like Policy & value iteration

In practice that big $\prod$ term is not great, since it tends to be smaller than 1 for each of the term (we are not selecting the action that’s according to the policy), and it got exponentially small in as $T$ goes.

Now we can write the objective a bit differently to solve this: let’s expand that expectation to sampling:

on-policy policy gradient, sampling form:

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t}) \hat{Q}_{i, t}

This can be understood as sampling at each step given the state each step. In other words, we took the on-policy gradient expressed as an expectation over the marginal state-action distribution at each timestep, and then we can apply the importance sampling on each marginal $(s_{t}, a t)$ separately, correcting from $p_{θ} (s_{t,} a_{t})$ to $p_{θ^{'}} (s_{t,} a_{t})$ .

off-policy policy gradient:

\nabla_{θ^{'}} J (θ^{'}) \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \frac{π _{θ^{'}} ( s _{i, t} , a _{i, t} )}{π _{θ} ( s _{i, t} , a _{i, t} )} \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{i, t} ∣ s_{i, t}) \hat{Q}_{i, t} = \frac{1}{N} i = 1 \sum N t = 1 \sum T \frac{π _{θ^{'}} ( s _{i, t} )}{π _{θ} ( s _{i, t} )} \frac{π _{θ^{'}} ( a _{i, t} ∣ s _{i, t} )}{π _{θ} ( a _{i, t} ∣ s _{i, t} )} \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{i, t} ∣ s_{i, t}) \hat{Q}_{i, t}

Implementing

Well when doing this in a framework we’ll work directly with $J$ , not $\nabla J$ , but our modifications are all on $\nabla J$ , so now we’ll need to have a “peudo-loss”:

\tilde{J} (θ) \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T lo g π_{θ} (a_{i, t} ∣ s_{i, t}) \hat{Q}_{i, t}

Use much larger batches
Tweaking learning rate is very hard

Yanda's Random Notes

Explorer