We can approximate the true value of a state by using function approximation. And we can find the best one using classic method.

Approximate state value function

Like if we use the very basic SGD, we just optimize this,

and we can sample the gradient by

For encoding the state we can use something like Coarse coding (for simple linear approximation).

The here can be replaced by TD target .

Approximate state-action value function

There can be two ways to this: action-in, or action-out. For example (the linear case),

  • Action in:
  • Action out: such that I think this can be viewed just from compute complexity point of view.
  • Action in: has dimension of . Thus has and we do element-wise multiplication. The same is used across all pair. But each state / action pair has different features.
  • Action out: has dimension of . We do element-wise multiplication. Multi-head parameterization. We use different for different action. But the actions from the same state shares the same feature.

Action in is better for continuous actions (look at action out and you can see why). Action out is more efficient for (small) discrete action spaces (I guess since it’s more expressive).

Convergence and Divergence

MC: this is easy. This can be seen as a simple regression: we have ground truth (), and there is close form solution for linear regression case. Not so simple for TD.

Let denote the value error:

The Monte Carlo solution minimises the value error Theorem

So TD error is bounded.

Still, TD update is not a true gradient update: it includes itself in the other side.

Deadly triad

Also a name from Sutton and Barto’s textbook. Algorithms that combine

  • bootstrapping
  • off-policy learning
  • function approximation … may diverge.