Batch Reinforcement Learning

Source: DeepMind x UCL RL Lecture Series - Function Approximation [7/13]

Finding the best fitting value function given a set of paste experience instead of doing it sample by sample.

Say if we parameterize the value function with linear functions, then an obvious choice is least squares:

⟹ E [(R_{t + 1} + γ v_{w} (S_{t + 1}) - v_{w} (S_{t})) x_{t}] = 0 w_{TD} = E [x_{t} (x_{t} - γ x_{t + 1})^{⊤}]^{- 1} E [R_{t + 1} x_{t}]

We can replace that expectation with sample-based approach:

⟹ \frac{1}{t} i = 0 \sum t (R_{i + 1} + γ v_{w} (S_{i + 1}) - v_{w} (S_{i})) x_{i} = 0 w_{LSTD} = (i = 0 \sum t x_{i} (x_{i} - γ x_{i + 1})^{⊤})^{- 1} (i = 0 \sum t R_{i + 1} x_{i})

We can update this on the fly for every new sample that comes in. That can be $O (n^{3})$ . But you see the inverse you would know that we can use Sherman-Morrison-Woodbury formula again:

A_{t + 1}^{- 1} b_{t + 1} = A_{t}^{- 1} - \frac{A _{t}^{- 1} x _{t} ( x _{t} - γ x _{t + 1} ) ^{⊤} A _{t}^{- 1}}{1 + ( x _{t} - γ x _{t + 1} ) ^{⊤} A _{t}^{- 1} x _{t}} = b_{t} + R_{t + 1} x_{t}

That leads to Experience Replay, which is basically staying maintaining a pool of trajectory:

Given experience consisting of trajectories of experience:

D = {S_{0}, A_{0}, R_{1}, S_{1}, \dots, S_{t}}

Repeat:

Sample transition(s), e.g., $(S_{n}, A_{n}, R_{n + 1}, S_{n + 1})$ for $n \leq t$
Apply stochastic gradient descent update:

\Delta\mathbf{w} = \alpha(R_{n+1} + \gamma v_{\mathbf{w}}(S_{n+1}) - v_{\mathbf{w}}(S_n))\nabla_{\mathbf{w}}v_{\mathbf{w}}(S_n)

3. C an re - u seo l dd a t a

Yanda's Random Notes

Explorer

Batch Reinforcement Learning

Graph View