Model free sampling learning. Prediction setting: learn online from experience under policy Monte Carlo

  • Update value towards sampled return

Temporal-difference learning:

  • Update value towards estimated return
  • is called the TD error

Comparison of backup

between dynamic programming, Monte Carlo and Temporal difference.

On bootstrapping

We call something bootstrapping if the update involves an estimate. I.e. it need to start from somewhere. Because of that, TD target is a bias estimate. But it has lower variance.

On the convergence

Monte Carlo converges to best mean-squared fit for the observed returns

TD converges to solution of max likelihood Markov model, given the data. It’s the solution to the empirical MDP that best fits the data.

We can kinda see that since TD only do one step.

TD exploits Markov property: can help in fully observable environments. MC does not exploit Markov property: can help in partially-observable environments.

Multi-step returns

Consider the following -step returns for :

In general, the -step return is defined by

Multi-step temporal-difference learning

With good tuning of and , it can converge faster and better than both MC and TD(0).

Mixing multi-step returns

Mixing bootstrapping and MC:

Multi-step returns bootstrap on one state, :

You can also bootstrap a little bit on multiple states:

This gives a weighted average of -step returns:

(Note, )

Think about it this way: if we only rely on , that’s MC. If only on bootstrap, that’s TD(0).

Intuition: is the ‘horizon’, so .