Source: DeepMind x UCL RL Lecture Series - Function Approximation [7/13]
Finding the best fitting value function given a set of paste experience instead of doing it sample by sample.
Say if we parameterize the value function with linear functions, then an obvious choice is least squares:
We can replace that expectation with sample-based approach:
We can update this on the fly for every new sample that comes in. That can be . But you see the inverse you would know that we can use Sherman-Morrison-Woodbury formula again:
That leads to Experience Replay, which is basically staying maintaining a pool of trajectory:
Given experience consisting of trajectories of experience:
Repeat:
-
Sample transition(s), e.g., for
-
Apply stochastic gradient descent update:
\Delta\mathbf{w} = \alpha(R_{n+1} + \gamma v_{\mathbf{w}}(S_{n+1}) - v_{\mathbf{w}}(S_n))\nabla_{\mathbf{w}}v_{\mathbf{w}}(S_n)