An algorithm proposed by R. Sutton.

  1. collect some data, consisting of transitions
  2. learn model (and optionally, )
  3. repeat times:
    1. sample from buffer
    2. choose action (from , from , or random)
    3. simulate (and )
    4. train on with model-free RL
    5. (optional) take more model-based steps

The algorithm mentioned here is close to MBPO.