Temporal difference

Model free sampling learning. Prediction setting: learn $v_{π}$ online from experience under policy $π$ Monte Carlo

Update value $v_{n} (S_{t})$ towards sampled return $G_{t}$

v_{n + 1} (S_{t}) = v_{n} (S_{t}) + α (G_{t} - v_{n} (S_{t}))

Temporal-difference learning:

Update value $v_{t} (S_{t})$ towards estimated return $R_{t + 1} + γ v (S_{t + 1})$

v_{t + 1} (S_{t}) \leftarrow v_{t} (S_{t}) + α (TD error R_{t + 1} + γ v_{t} (S_{t + 1}) target - v_{t} (S_{t}))

$δ_{t} = R_{t + 1} + γ v_{t} (S_{t + 1}) - v_{t} (S_{t})$ is called the TD error

Comparison of backup

between dynamic programming, Monte Carlo and Temporal difference.

On bootstrapping

We call something bootstrapping if the update involves an estimate. I.e. it need to start from somewhere. Because of that, TD target is a bias estimate. But it has lower variance.

On the convergence

Monte Carlo converges to best mean-squared fit for the observed returns

k = 1 \sum K t = 1 \sum T_{k} (G_{t}^{k} - v (S_{t}^{k}))^{2}

TD converges to solution of max likelihood Markov model, given the data. It’s the solution to the empirical MDP $(S, A, \hat{P}, γ)$ that best fits the data.

We can kinda see that since TD only do one step.

TD exploits Markov property: can help in fully observable environments. MC does not exploit Markov property: can help in partially-observable environments.

Multi-step returns

Consider the following $n$ -step returns for $n = 1, 2, \infty$ :

n = 1 n = 2 ⋮ n = \infty (TD) (MC) G_{t}^{(1)} = R_{t + 1} + γ v (S_{t + 1}) G_{t}^{(2)} = R_{t + 1} + γ R_{t + 2} + γ^{2} v (S_{t + 2}) ⋮ G_{t}^{(\infty)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{T - t - 1} R_{T}

In general, the $n$ -step return is defined by

G_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{n - 1} R_{t + n} + γ^{n} v (S_{t + n})

Multi-step temporal-difference learning

v (S_{t}) \leftarrow v (S_{t}) + α (G_{t}^{(n)} - v (S_{t}))

With good tuning of $α$ and $n$ , it can converge faster and better than both MC and TD(0).

Mixing multi-step returns

Mixing bootstrapping and MC:

Multi-step returns bootstrap on one state, $v (S_{t + n})$ :

G_{t}^{(n)} G_{t}^{(1)} = R_{t + 1} + γ G_{t + 1}^{(n - 1)} (while n > 1, continue) = R_{t + 1} + γ v (S_{t}) . (truncate & bootstrap)

You can also bootstrap a little bit on multiple states:

G_{t}^{λ} = R_{t + 1} + γ ((1 - λ) v (S_{t + 1}) + λ G_{t + 1}^{λ})

This gives a weighted average of $n$ -step returns:

G_{t}^{λ} = n = 1 \sum \infty (1 - λ) λ^{n - 1} G_{t}^{(n)}

(Note, $\sum_{n = 1}^{\infty} (1 - λ) λ^{n - 1} = 1$ )

Think about it this way: if we only rely on $G_{t + 1}$ , that’s MC. If only on bootstrap, that’s TD(0).

Intuition: $1/ (1 - λ)$ is the ‘horizon’, so $λ = 0.9 \approx n = 10$ .

Yanda's Random Notes

Explorer

Temporal difference

Comparison of backup

On bootstrapping

On the convergence

Multi-step returns

Mixing multi-step returns

Graph View

Table of Contents

Backlinks