Monte Carlo tree search

We are basically building a tree given discrete actions. We switch to other policies if the tree grows too large.

You can see that the policy here is basically UCB. Recall the formula there is

a_{t} = a \in A argmax Q_{t} (a) + c \frac{lo g t}{N _{t} ( a )}

The difference is $t$ vs $N (s_{t - 1})$ . UCT can be understood simply as applying UCB on each tree node (use parent instead of global count). In bandit case there’s just one state, but now we have multiple.

Yanda's Random Notes

Explorer

Monte Carlo tree search

Graph View

Backlinks