On-policy learning

  • Learn about behaviour policy from experience sampled from

Off-policy learning

  • Learn about target policy from experience sampled from
  • Learn ‘counterfactually’ about other things you could do: “what if…?”
    • E.g., “What if I would turn left?” new observations, rewards?
    • E.g., “What if I would play more defensively?” different win probability?

Evaluate target policy to compute or
While using behaviour policy to generate actions

Why is this important?

  • Learn from observing humans or other agents (e.g., from logged data)
  • Re-use experience from old policies (e.g., from your own past experience)
  • Learn about multiple policies while following one policy
  • Learn about greedy policy while following exploratory policy