An interactive MCTS lab — authored by Zvika Berger
In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree.
MCTS was combined with neural networks in 2016 and has been used in multiple board games like Chess, Shogi, Checkers, Backgammon, Contract Bridge, Go, Scrabble, and Clobber as well as in turn-based-strategy video games (such as Total War: Rome II's implementation in the high-level campaign AI) and applications outside of games.
A game tree grows exponentially with depth. If a position has on average b legal moves and the game lasts d plies, the total number of leaves is roughly bd. Even modest values make exhaustive search hopeless.
Minimax visits every leaf of the tree to compute the exact value. That's fine for tic-tac-toe — but for chess (10120) it would take longer than the age of the universe.
Alpha-Beta pruning helps, but only as a constant-factor improvement: with perfect move ordering it cuts the effective branching factor from b down to roughly √b, so the work becomes ~bd/2. For chess that's still ~1060, for go ~10180 — still impossible. Move ordering also depends on a strong heuristic evaluation function, which is what makes alpha-beta so hard to apply to games like go where no simple evaluation exists.
MCTS dodges the problem entirely. It never enumerates the full tree — it samples the most promising branches via random rollouts and grows the search tree asymmetrically. With the same compute budget it can effectively search much deeper than minimax / alpha-beta, which is why it powered AlphaGo and modern game engines.
N(v) and add the reward to Q(v).Each iteration of MCTS is exactly four phases. After many iterations the algorithm converges to a strong recommendation.
Following the classic Wikipedia tree — each node shows wins / visits. The four panels below trace one full iteration from root 11/21 through the selection chain to a new leaf, a random rollout that wins, and the resulting backpropagated counts.
# mcts.py — the complete MCTS algorithm import math, random class MCTSNode: def __init__(self, state, parent=None, action_taken=None): self.state, self.parent = state, parent self.action_taken = action_taken self.children = [] self.visit_count = 0 self.total_reward = 0.0 self.untried_actions = state.legal_actions() @property def is_terminal(self): return self.state.is_terminal() @property def is_fully_expanded(self): return len(self.untried_actions) == 0 def ucb1(self, C=math.sqrt(2)): if self.visit_count == 0: return float("inf") mean = self.total_reward / self.visit_count return mean + C * math.sqrt(math.log(self.parent.visit_count) / self.visit_count) def best_child(self, C): return max(self.children, key=lambda c: c.ucb1(C)) def expand(self): action = self.untried_actions.pop() child = MCTSNode(self.state.apply(action), parent=self, action_taken=action) self.children.append(child) return child class MCTS: def __init__(self, C=math.sqrt(2)): self.C = C def search(self, root_state, num_simulations=1000): root = MCTSNode(state=root_state) for _ in range(num_simulations): node = self._select(root) # Phase 1 — SELECTION if not node.is_terminal: # Phase 2 — EXPANSION node = node.expand() reward = self._simulate(node.state) # Phase 3 — SIMULATION self._backpropagate(node, reward) # Phase 4 — BACKPROP return max(root.children, key=lambda c: c.visit_count).action_taken def _select(self, node): while not node.is_terminal and node.is_fully_expanded: node = node.best_child(self.C) return node def _simulate(self, state): while not state.is_terminal(): state = state.apply(random.choice(state.legal_actions())) return state.reward() def _backpropagate(self, node, reward): while node is not None: node.visit_count += 1 node.total_reward += reward reward = 1.0 - reward # flip for opponent's perspective node = node.parent
UCB1 already directed effort to the truly best moves, so visit-count is essentially MCTS's "confidence vote."
Q / NC · √(ln Np / N)def ucb1(self, exploration_constant=math.sqrt(2)): if self.visit_count == 0: return float("inf") # always explore unvisited children first exploitation = self.total_reward / self.visit_count exploration = exploration_constant * math.sqrt( math.log(self.parent.visit_count) / self.visit_count ) return exploitation + exploration
A subtle but critical detail: N(v) == 0 returns
+∞, so every child is tried at least once before scores are compared.
Inside backprop, each node stores the reward from the perspective of the player who moved to it — so UCB1 maximisation at every level is correct without manually flipping signs at each step.
Each arm is a slot machine with a hidden true reward (the tower's height). Three agents pull arms in parallel and try to find the tallest one. UCB1 explores until it's confident, then exploits. ε-greedy mostly exploits but takes random shots. Random doesn't learn. Watch the regret chart below — UCB's gap to the optimal arm shrinks fastest.
Click Next phase to advance MCTS one phase at a time. The active phase glows, the search path lights up, and the tree builds in front of you. Edit the board above and hit ↺ Reset tree to start over from a new position.
AlphaGo actually trained two policy networks: an SL policy (network 1, learned by imitating ~30M human moves) and a stronger RL policy (network 2, improved by playing itself). You'd expect the stronger RL net to be the one wired into the MCTS — but in the historic match against Lee Sedol it was never used inside the search tree at all.
DeepMind found the SL policy guided Selection & Expansion better. Having only ever played itself, the RL net's vision was narrow — fixated on moves that were optimal against itself. The human-trained SL net offered a broader, more natural variety of moves, so the tree explored far more directions.
Then why train the RL net at all? It was a data factory: it played ~30 million games against itself, and the outcomes of those games were the sole training data for the value network.
A probability over all legal moves. Role: the prior used during Selection.
Trained from: the MCTS visit distribution in self-play.
One scalar in [−1,+1]. Role: the leaf score in Simulation — replaces the rollout.
Trained from: the actual self-play outcome z.
Each node shows value / visits. One search iteration — but Selection uses the two-head net's policy prior on the first move from the root, and Simulation is its value head, not a random play-out.
P(s,a) — it biases the very first move from the root (no UCB1).V(s)=+0.46.Q, alternating by player.s, the MCTS visit distribution π at that state, and the final game outcome z (±1 / 0).KL(policy, π) + (V − z)2 + L2 regularisation. Policy learns from MCTS; value learns from actual outcomes.