Function approximation

We can approximate the true value of a state by using function approximation. And we can find the best one using classic method.

Approximate state value function

Like if we use the very basic SGD, we just optimize this,

J (w) = E_{S \sim d} [(v_{π} (S) - v_{w} (S))^{2}]

and we can sample the gradient by

Δ w = α (G_{t} - v_{w} (S_{t})) \nabla_{w} v_{w} (S_{t})

For encoding the state we can use something like Coarse coding (for simple linear approximation).

The $G_{t}$ here can be replaced by TD target $R_{t + 1} + γ v_{w} (S_{t + 1})$ .

Approximate state-action value function

There can be two ways to this: action-in, or action-out. For example (the linear case),

Action in: $q_{w} (s, a) = w^{⊤} x (s, a)$
Action out: $q_{w} (s) = Wx (s)$ such that $q_{w} (s, a) = q_{w} (s) [a] = x (s)^{T} w_{a}$ I think this can be viewed just from compute complexity point of view.
Action in: $x (s, a)$ has dimension of $(s + a, 1)$ . Thus $w$ has $(1, s + a)$ and we do $(s + a)$ element-wise multiplication. The same $w$ is used across all $(s, a)$ pair. But each state / action pair has different features.
Action out: $w$ has dimension of $(a, s)$ . We do $s * a$ element-wise multiplication. Multi-head parameterization. We use different $w$ for different action. But the actions from the same state shares the same feature.

Action in is better for continuous actions (look at action out and you can see why). Action out is more efficient for (small) discrete action spaces (I guess since it’s more expressive).

Convergence and Divergence

MC: this is easy. This can be seen as a simple regression: we have ground truth ( $G_{t}$ ), and there is close form solution for linear regression case. Not so simple for TD.

w_{MC} w_{TD} = ar g w min E_{π} [(G_{t} - v_{w} (S_{t}))^{2}] = E_{π} [x_{t} x_{t}^{⊤}]^{- 1} E_{π} [G_{t} x_{t}] = E [x_{t} (x_{t} - γ x_{t + 1})^{⊤}]^{- 1} E [R_{t + 1} x_{t}]

Let $\overline{VE} (w)$ denote the value error:

\overline{VE} (w) = ∥ v_{π} - v_{w} ∥_{d_{π}} = s \in S \sum d_{π} (s) (v_{π} (s) - v_{w} (s))^{2}

The Monte Carlo solution minimises the value error Theorem

\overline{VE} (w_{TD}) \leq \frac{1}{1 - γ} \overline{VE} (w_{MC}) = \frac{1}{1 - γ} w min \overline{VE} (w)

So TD error is bounded.

Still, TD update is not a true gradient update: it includes itself in the other side.

Deadly triad

Also a name from Sutton and Barto’s textbook. Algorithms that combine

bootstrapping
off-policy learning
function approximation … may diverge.

Yanda's Random Notes

Explorer

Function approximation

Approximate state value function

Approximate state-action value function

Convergence and Divergence

Deadly triad

Graph View

Table of Contents

Backlinks