Goal: given some function with random inputs , and a distribution , estimate the expectation of under a different (target) distribution .

Solution: weight the data by the ratio

This is useful in Off-Policy learning. Ergo, when following policy , can use as unbiased sample.

For Off-Policy Monto Carlo:

Goal: estimate
Data: trajectory generated with
Solution: use return , and correct:

For TD

For SARSA

This is called Expected SARSA

  • No importance sampling is required
  • Next action may be chosen using behaviour policy
  • But we consider probabilities under
  • Update towards value of alternative action

So you can see the expectation there instead of the in Q. Actually, Q-learning is a special case with greedy target policy .