Q learning

See also Deep Q Recall dynamic programming algorithms Policy & value iteration

v_{k + 1} (s) v_{k + 1} (s) q_{k + 1} (s, a) q_{k + 1} (s, a) = E [R_{t + 1} + γ v_{k} (S_{t + 1}) ∣ S_{t} = s, A_{t} \sim π (S_{t})] = a max E [R_{t + 1} + γ v_{k} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a] = E [R_{t + 1} + γ q_{k} (S_{t + 1}, A_{t + 1}) ∣ S_{t} = s, A_{t} = a] = E [R_{t + 1} + γ a^{'} max q_{k} (S_{t + 1}, a^{'}) ∣ S_{t} = s, A_{t} = a] (policy evaluation) (value iteration) (policy evaluation) (value iteration)

We have the analogous model -free TD algorithms:

v_{t + 1} (S_{t}) q_{t + 1} (s, a) q_{t + 1} (s, a) = v_{t} (S_{t}) + α_{t} (R_{t + 1} + γ v_{t} (S_{t + 1}) - v_{t} (S_{t})) (TD) = q_{t} (S_{t}, A_{t}) + α_{t} (R_{t + 1} + γ q_{t} (S_{t + 1}, A_{t + 1}) - q_{t} (S_{t}, A_{t})) (SARSA) = q_{t} (S_{t}, A_{t}) + α_{t} (R_{t + 1} + γ a^{'} max q_{t} (S_{t + 1}, a^{'}) - q_{t} (S_{t}, A_{t})) (Q-learning)

Of course the value iteration on state $v$ cannot be sampled, so there’s no TD algorithm.

Q learning is an Off-Policy algorithm. It estimates the value of the greedy policy

q_{t + 1} (s, a) = q_{t} (S_{t}, A_{t}) + α_{t} (R_{t + 1} + γ a^{'} max q_{t} (S_{t + 1}, a^{'}) - q_{t} (S_{t}, A_{t}))

Acting greedy all the time would not explore sufficiently.

It’s soundness depend on a new theorem:

Q-learning control converges to the optimal action-value function, $q \to q^{*}$ , as long as we take each action in each state infinitely often.

This is different from GLIE: the policy doesn’t need to converge to greedy. Works for any policy that eventually selects all actions sufficiently often (Requires appropriately decaying step sizes $\sum_{t} α_{t} = \infty$ , $\sum_{t} α_{t}^{2} < \infty$ , E.g., $α = 1/ t^{ω}$ , with $ω \in (0.5, 1)$ )

A comparison of SARSA and Q learning

In training time, SARSA gets higher reward since it learns that walking on the edge is dangerous, and it learns a safer path. Recall it’s using $ϵ$ -greedy for both prediction and control. On the other hand, Q learning would stick to the optimal path, since according to its value estimation the optimal path is just the better one, since it has a larger $ma x$ . So it would explore that path more often and got down more often. Note the final policy may be a more optimal one.

Overestimation

Recall

a max q_{t} (S_{t + 1}, a) = q_{t} (S_{t + 1}, ar g a max q_{t} (S_{t + 1}, a))

Uses same values to select and to evaluate. … but values are approximate

more likely to select overestimated values
less likely to select underestimated values

That “max” is persisting. Imagine you are in a state with 100 actions with stochastic outcome. If one of the action by chance got jackpot, then you would keep exploring that state. That leads to super slow convergence: blinded by overestimated values.

Another way to think about it: say we have two random variables: $X_{1}$ and $X_{2}$ ,

E [max (X_{1}, X_{2})] \geq max (E [X_{1}], E [X_{2}])

our q estimation is a noisy estimation, and we are using the left one to estimate the right.

Double Q-learning

Store two action-value functions: $q$ and $q^{'}$

R_{t + 1} + γ q_{t}^{'} (S_{t + 1}, ar g a max q_{t} (S_{t + 1}, a)) (1)

R_{t + 1} + γ q_{t} (S_{t + 1}, ar g a max q_{t}^{'} (S_{t + 1}, a)) (2)

Each $t$ , pick $q$ or $q^{'}$ (e.g., randomly) and update using (1) for $q$ or (2) for $q^{'}$ .

This solves the issue because now the noise is decorrelated. The “max” when selecting the action may not actually leads to the “max” when actually getting the estimated Q value.

We can also extend this to SARSA.

Yanda's Random Notes

Explorer

Q learning

A comparison of SARSA and Q learning

Overestimation

Double Q-learning

Graph View

Table of Contents

Backlinks