Yanda's Random Notes

❯

Reinforcement Learning

❯

Dyna

Feb 22, 20261 min read

An algorithm proposed by R. Sutton.

collect some data, consisting of transitions $(s, a, s^{'}, r)$
learn model $\overset{p}{^} (s^{'} ∣ s, a)$ (and optionally, $\overset{r}{^} (s, a)$ )
repeat $K$ times:
1. sample $s \sim B$ from buffer
2. choose action $a$ (from $B$ , from $π$ , or random)
3. simulate $s^{'} \sim \overset{p}{^} (s^{'} ∣ s, a)$ (and $r = \overset{r}{^} (s, a)$ )
4. train on $(s, a, s^{'}, r)$ with model-free RL
5. (optional) take $N$ more model-based steps

The algorithm mentioned here is close to MBPO.

Graph View

Backlinks

Model based and model free learning
Model based

Created with Quartz v4.5.2 © 2026