Fitted Q Iteration

Coming from Fitted Value Iteration, we don’t have transition dynamics. $Q$ is good (we can see the same idea in Off-policy actor-critic) as the reward now does not depend on your policy, just $r (s, a)$ , not $r (s, π (s))$ .

Recall previously we had Now it’s

Note how the $max_{a_{i}}$ disappears, so this works even for off-policy samples.

Why fitted value iteration needs the environment but fitted Q doesn’t: In fitted value iteration, we compute $max_{a} [r (s_{i}, a) + γ E [V (s^{'})]]$ . Our dataset only has one tuple $(s_{i}, a_{i}, r_{i}, s_{i}^{'})$ — we took one action and saw one outcome. To evaluate the other actions inside that max, we’d need to know what reward and next state each action leads to, which is exactly the transition model $P (s^{'} ∣ s_{i}, a)$ . These are two sides of the same coin: we’re missing data for actions we didn’t take, because we lack transition knowledge.

In fitted Q iteration, there’s no max over actions at $s_{i}$ . We just use our one tuple as-is: $y_{i} = r_{i} + γ max_{a^{'}} Q_{θ} (s_{i}^{'}, a^{'})$ . The max happens at $s_{i}^{'}$ and only requires forward passes through our own network $Q_{θ}$ — no environment interaction, no missing data.

The $max_{a^{'}} Q_{θ} (s_{i}^{'}, a^{'})$ is "free"

This max only queries our own function approximator at different action inputs. Compare with the max in fitted value iteration, where the quantity inside the max ( $r (s, a) + γ E [V (s^{'})]$ ) depends on the environment’s dynamics. That’s the fundamental asymmetry.

This is off policy, and the $π$ only show up implicitly in that $max_{a_{i}^{'}}$ , since we know it’s max by asking the policy to estimate $Q$ at $s_{i}^{'}$ . (Our policy is greedy, so policy and Q are basically the same).

We can make it online:

And for exploring in step one we can use Greedy and epsilon greedy. You can see more exploration way there too.o

Also see the traditional Q learning. The change are basically if we use batch / function approximation.

Now we have correlated samples, on to the Deep Q.

Yanda's Random Notes

Explorer

Fitted Q Iteration

Graph View

Backlinks