Goal: given some function with random inputs , and a distribution , estimate the expectation of under a different (target) distribution .
Solution: weight the data by the ratio
This is useful in Off-Policy learning. Ergo, when following policy , can use as unbiased sample.
For Off-Policy Monto Carlo:
Goal: estimate
Data: trajectory generated with
Solution: use return , and correct:
For TD
For SARSA
This is called Expected SARSA
- No importance sampling is required
- Next action may be chosen using behaviour policy
- But we consider probabilities under
- Update towards value of alternative action
So you can see the expectation there instead of the in Q. Actually, Q-learning is a special case with greedy target policy .