We are basically building a tree given discrete actions. We switch to other policies if the tree grows too large.
You can see that the policy here is basically UCB. Recall the formula there is
The difference is vs . UCT can be understood simply as applying UCB on each tree node (use parent instead of global count). In bandit case there’s just one state, but now we have multiple.