On-policy learning
- Learn about behaviour policy from experience sampled from
Off-policy learning
- Learn about target policy from experience sampled from
- Learn ‘counterfactually’ about other things you could do: “what if…?”
- E.g., “What if I would turn left?” new observations, rewards?
- E.g., “What if I would play more defensively?” different win probability?
Evaluate target policy to compute or
While using behaviour policy to generate actions
Why is this important?
- Learn from observing humans or other agents (e.g., from logged data)
- Re-use experience from old policies (e.g., from your own past experience)
- Learn about multiple policies while following one policy
- Learn about greedy policy while following exploratory policy