We can approximate the true value of a state by using function approximation. And we can find the best one using classic method.
Approximate state value function
Like if we use the very basic SGD, we just optimize this,
and we can sample the gradient by
For encoding the state we can use something like Coarse coding (for simple linear approximation).
The here can be replaced by TD target .
Approximate state-action value function
There can be two ways to this: action-in, or action-out. For example (the linear case),
- Action in:
- Action out: such that I think this can be viewed just from compute complexity point of view.
- Action in: has dimension of . Thus has and we do element-wise multiplication. The same is used across all pair. But each state / action pair has different features.
- Action out: has dimension of . We do element-wise multiplication. Multi-head parameterization. We use different for different action. But the actions from the same state shares the same feature.
Action in is better for continuous actions (look at action out and you can see why). Action out is more efficient for (small) discrete action spaces (I guess since it’s more expressive).
Convergence and Divergence
MC: this is easy. This can be seen as a simple regression: we have ground truth (), and there is close form solution for linear regression case. Not so simple for TD.
Let denote the value error:
The Monte Carlo solution minimises the value error Theorem
So TD error is bounded.
Still, TD update is not a true gradient update: it includes itself in the other side.
Deadly triad
Also a name from Sutton and Barto’s textbook. Algorithms that combine
- bootstrapping
- off-policy learning
- function approximation … may diverge.