An algorithm proposed by R. Sutton.
- collect some data, consisting of transitions
- learn model (and optionally, )
- repeat times:
- sample from buffer
- choose action (from , from , or random)
- simulate (and )
- train on with model-free RL
- (optional) take more model-based steps
The algorithm mentioned here is close to MBPO.
