This is related to the TD equation in Function approximation. Note how we do not derive the formula for TD for . We just say “replace this part in MC with TD estimation”. That’s because we are not really minimizing the real loss: .
Now recall
We can compute out the “more sound” update:
This tends to work worse in practice. It smooth both state (from and to).
We can also minimize the Bellman error directly (L1 loss).
loss:
update:
…but requires a second independent sample which could (randomly) differ from . (So we can’t use this online)