See also Deep Q Recall dynamic programming algorithms Policy & value iteration

We have the analogous model -free TD algorithms:

Of course the value iteration on state cannot be sampled, so there’s no TD algorithm.

Q learning is an Off-Policy algorithm. It estimates the value of the greedy policy

Acting greedy all the time would not explore sufficiently.

It’s soundness depend on a new theorem:

Q-learning control converges to the optimal action-value function, , as long as we take each action in each state infinitely often.

This is different from GLIE: the policy doesn’t need to converge to greedy. Works for any policy that eventually selects all actions sufficiently often (Requires appropriately decaying step sizes , , E.g., , with )

A comparison of SARSA and Q learning

In training time, SARSA gets higher reward since it learns that walking on the edge is dangerous, and it learns a safer path. Recall it’s using -greedy for both prediction and control. On the other hand, Q learning would stick to the optimal path, since according to its value estimation the optimal path is just the better one, since it has a larger . So it would explore that path more often and got down more often. Note the final policy may be a more optimal one.

Overestimation

Recall

Uses same values to select and to evaluate. … but values are approximate

  • more likely to select overestimated values
  • less likely to select underestimated values

That “max” is persisting. Imagine you are in a state with 100 actions with stochastic outcome. If one of the action by chance got jackpot, then you would keep exploring that state. That leads to super slow convergence: blinded by overestimated values.

Another way to think about it: say we have two random variables: and ,

our q estimation is a noisy estimation, and we are using the left one to estimate the right.

Double Q-learning

Store two action-value functions: and

Each , pick or (e.g., randomly) and update using (1) for or (2) for .

This solves the issue because now the noise is decorrelated. The “max” when selecting the action may not actually leads to the “max” when actually getting the estimated Q value.

We can also extend this to SARSA.