Residual Bellman updates

This is related to the TD equation in Function approximation. Note how we do not derive the formula for TD for $δ w_{t}$ . We just say “replace this part in MC with TD estimation”. That’s because we are not really minimizing the real loss: $E [δ_{t}^{2}]$ .

Now recall

δ_{t} = R_{t + 1} + γ v_{w} (s_{t + 1}) - v_{w} (S_{t})

We can compute out the “more sound” update:

Δ w_{t} = α δ_{t} \nabla_{w} (v_{w} (S_{t}) - γ v_{w} (S_{t + 1}))

This tends to work worse in practice. It smooth both state (from and to).

We can also minimize the Bellman error directly (L1 loss).

loss:

E [δ_{t}]^{2}

update:

Δ w_{t} = α δ_{t} \nabla_{w} (v_{w} (S_{t}) - γ v_{w} (S_{t + 1}^{'}))

…but requires a second independent sample $S_{t + 1}^{'}$ which could (randomly) differ from $S_{t + 1}$ . (So we can’t use this online)

Yanda's Random Notes

Explorer

Residual Bellman updates

Graph View