Source: The 2023 version of CS285, lecture 5.
REINFORCE
Policy gradient is direct optimization: for Function approximation we are just going to find the such that:
So here is basically the whole trajectory. We are just trying to compute the derivative of w.r.t. . Now, there is this expectation here, and it’s hard for us to get the gradient on that.
This is super nice, since well, we basically want to have a inside that . This is because well there’s a convenient property:
Let’s then look at that . What’s that exactly?
The parts are canceled since they are not related to . Finally, we got
REINFORCE algorithm:
- sample from (run the policy)
We are seeing a lot recently! See Score function note for more.
More interpretation
Now look at Maximum likelihood, say we are doing Imitation Learning:
Well that’s very much just our REINFORCE formula, but just without the bit. You can say policy gradient is weighting the experience with the reward received, trial and error.
Note that the Markov property is not exploited in the derivation.
Problems
Note that we are rating the trajectory via reward. If we have three samples, [-2, 1, 2], and that should be equivalent to [1, 4, 5] (see baseline later), but variance wise that’s different. With finite samples, these do mean different things to the estimation.
Causality
Causality: policy at time cannot affect reward at time when
We are writing out the expectation because then we can clearly see that one trajectory is really from 1 to time , and use the distributive law. That is a hint to the value function.
Baseline
Note , so we can just reduce a baseline to the rewards
We can get it by average or other ways, for example, just find out what’s the best , solving . We can get that , expected reward weighted by gradient magnitudes.
Off Policy
Refer to Off-Policy for some background, and we know a way to make on-policy method work for off policy is Importance sampling corrections
The last one is taking causality into account. The can be crossed out for simplicity, which makes it more like Policy & value iteration
In practice that big term is not great, since it tends to be smaller than 1 for each of the term (we are not selecting the action that’s according to the policy), and it got exponentially small in as goes.
Now we can write the objective a bit differently to solve this: let’s expand that expectation to sampling:
on-policy policy gradient, sampling form:
This can be understood as sampling at each step given the state each step. In other words, we took the on-policy gradient expressed as an expectation over the marginal state-action distribution at each timestep, and then we can apply the importance sampling on each marginal separately, correcting from to .
off-policy policy gradient:
Implementing
Well when doing this in a framework we’ll work directly with , not , but our modifications are all on , so now we’ll need to have a “peudo-loss”:
- Use much larger batches
- Tweaking learning rate is very hard
See also Natural Policy Gradient