This is related to the TD equation in Function approximation. Note how we do not derive the formula for TD for . We just say “replace this part in MC with TD estimation”. That’s because we are not really minimizing the real loss: .

Now recall

We can compute out the “more sound” update:

This tends to work worse in practice. It smooth both state (from and to).

We can also minimize the Bellman error directly (L1 loss).

loss:

update:

…but requires a second independent sample which could (randomly) differ from . (So we can’t use this online)