Source: DeepMind x UCL RL Lecture Series - Function Approximation [7/13]

Finding the best fitting value function given a set of paste experience instead of doing it sample by sample.

Say if we parameterize the value function with linear functions, then an obvious choice is least squares:

We can replace that expectation with sample-based approach:

We can update this on the fly for every new sample that comes in. That can be . But you see the inverse you would know that we can use Sherman-Morrison-Woodbury formula again:

That leads to Experience Replay, which is basically staying maintaining a pool of trajectory:

Given experience consisting of trajectories of experience:

Repeat:

  1. Sample transition(s), e.g., for

  2. Apply stochastic gradient descent update:

\Delta\mathbf{w} = \alpha(R_{n+1} + \gamma v_{\mathbf{w}}(S_{n+1}) - v_{\mathbf{w}}(S_n))\nabla_{\mathbf{w}}v_{\mathbf{w}}(S_n)