03 · Planning

Monte Carlo Tree Search

An interactive MCTS lab — authored by Zvika Berger

What is MCTS?

In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree.

MCTS was combined with neural networks in 2016 and has been used in multiple board games like Chess, Shogi, Checkers, Backgammon, Contract Bridge, Go, Scrabble, and Clobber as well as in turn-based-strategy video games (such as Total War: Rome II's implementation in the high-level campaign AI) and applications outside of games.

— Adapted from Wikipedia

The problem: game trees explode

A game tree grows exponentially with depth. If a position has on average b legal moves and the game lasts d plies, the total number of leaves is roughly b^d. Even modest values make exhaustive search hopeless.

Game	Avg. branching	Avg. depth	Total positions ≈
Tic-Tac-Toe	~ 5	9	~ 2.6 × 10⁵ (solvable)
Chess	~ 35	~ 80	~ 10¹²⁰ (Shannon number)
Go (19×19)	~ 250	~ 150	~ 10³⁶⁰ (more than atoms in the universe)

Minimax visits every leaf of the tree to compute the exact value. That's fine for tic-tac-toe — but for chess (10¹²⁰) it would take longer than the age of the universe.

Alpha-Beta pruning helps, but only as a constant-factor improvement: with perfect move ordering it cuts the effective branching factor from b down to roughly √b, so the work becomes ~b^d/2. For chess that's still ~10⁶⁰, for go ~10¹⁸⁰ — still impossible. Move ordering also depends on a strong heuristic evaluation function, which is what makes alpha-beta so hard to apply to games like go where no simple evaluation exists.

MCTS dodges the problem entirely. It never enumerates the full tree — it samples the most promising branches via random rollouts and grows the search tree asymmetrically. With the same compute budget it can effectively search much deeper than minimax / alpha-beta, which is why it powered AlphaGo and modern game engines.

When each method works · when it fails

Method	Works when…	Fails when…
Minimax	Tree is small (~10⁶ nodes)	Branching factor is huge (Go has ~250 moves/turn)
Alpha-Beta	Good move ordering + a strong evaluation function exist	Branching factor is still very large, or no simple eval (e.g. Go)
MCTS	Simulation is fast and cheap; no evaluation function needed	Simulation is slow or reward is delayed past many moves

Where MCTS is used in the real world

AlphaGo / AlphaZero

Go has ~250 moves per turn. Exhaustive search is impossible — MCTS paired with a neural network solved it.

Chess engines

Leela Chess Zero combines MCTS with deep neural value/policy nets.

Video-game AI

Real-time strategy games with huge action spaces and uncertainty.

Retrosynthesis

MCTS plans multi-step chemical syntheses — searching trees of reactions to find a route to a target molecule (e.g. AiZynthFinder).

Robot planning

Path planning in uncertain environments with sparse reward signals.

Combinatorial opt.

Scheduling, bin packing, travelling-salesman variants.

The four phases of one MCTS iteration

PHASE 01

Selection

Starting at the root, walk DOWN the tree, always choosing the child with the highest UCB1 score. Stop at a node that still has untried actions or is terminal.

PHASE 02

Expansion

Pick one untried action from the leaf, create the corresponding new child node, and attach it to the tree.

PHASE 03

Simulation

From the new child, play out random moves until the game ends. Record the outcome (win / lose / draw).

PHASE 04

Backpropagation

Walk back UP to the root. At every node on the path, increment N(v) and add the reward to Q(v).

The MCTS Loop

Each iteration of MCTS is exactly four phases. After many iterations the algorithm converges to a strong recommendation.

Worked example — one full MCTS iteration

Following the classic Wikipedia tree — each node shows wins / visits. The four panels below trace one full iteration from root 11/21 through the selection chain to a new leaf, a random rollout that wins, and the resulting backpropagated counts.

PHASE 01Selection

Walk down via UCB1, always picking the most-promising child. Stop at a leaf with untried actions.

PHASE 02Expansion

Add one new child below the selected leaf (a fresh node initialised to 0/0).

PHASE 03Simulation

Random rollout from the new child until terminal. Outcome here: a win.

PHASE 04Backpropagation

Walk back up. Visits go +1 everywhere, but the reward alternates by player: +1/+1 on the winning side's nodes, −1/+1 on the opponent's.

In Python — the entire algorithm

mcts.py
Python 3
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# mcts.py — the complete MCTS algorithm
import math, random

class MCTSNode:
    def __init__(self, state, parent=None, action_taken=None):
        self.state, self.parent = state, parent
        self.action_taken = action_taken
        self.children = []
        self.visit_count = 0
        self.total_reward = 0.0
        self.untried_actions = state.legal_actions()

    @property
    def is_terminal(self):       return self.state.is_terminal()
    @property
    def is_fully_expanded(self): return len(self.untried_actions) == 0

    def ucb1(self, C=math.sqrt(2)):
        if self.visit_count == 0: return float("inf")
        mean = self.total_reward / self.visit_count
        return mean + C * math.sqrt(math.log(self.parent.visit_count) / self.visit_count)

    def best_child(self, C):
        return max(self.children, key=lambda c: c.ucb1(C))

    def expand(self):
        action = self.untried_actions.pop()
        child  = MCTSNode(self.state.apply(action), parent=self, action_taken=action)
        self.children.append(child)
        return child


class MCTS:
    def __init__(self, C=math.sqrt(2)):
        self.C = C

    def search(self, root_state, num_simulations=1000):
        root = MCTSNode(state=root_state)
        for _ in range(num_simulations):
            node = self._select(root)            # Phase 1 — SELECTION
            if not node.is_terminal:             # Phase 2 — EXPANSION
                node = node.expand()
            reward = self._simulate(node.state)  # Phase 3 — SIMULATION
            self._backpropagate(node, reward)    # Phase 4 — BACKPROP
        return max(root.children, key=lambda c: c.visit_count).action_taken

    def _select(self, node):
        while not node.is_terminal and node.is_fully_expanded:
            node = node.best_child(self.C)
        return node

    def _simulate(self, state):
        while not state.is_terminal():
            state = state.apply(random.choice(state.legal_actions()))
        return state.reward()

    def _backpropagate(self, node, reward):
        while node is not None:
            node.visit_count += 1
            node.total_reward += reward
            reward = 1.0 - reward             # flip for opponent's perspective
            node = node.parent

Why most-visited, not highest Q-value?

After all simulations, you might think to choose the child with the highest average reward. Don't.

Unreliable

A node with 3 visits, all wins has Q = 1.0 — but it was barely explored. The 1.0 could be luck.

Reliable

A node with 300 visits, 70 % win rate has Q = 0.7 — heavily sampled, so the estimate is trustworthy.

UCB1 already directed effort to the truly best moves, so visit-count is essentially MCTS's "confidence vote."

The UCB1 formula

Upper Confidence Bound 1

UCB1(v) = μ_v + C · √( ln N / N(v) )

Exploitation · Q / N

The average reward seen through this child. High values mean "this move has been good so far — keep using it."

Exploration · C · √(ln N_p / N)

Grows when this node is under-visited. Forces MCTS to occasionally try children that don't look the best yet — maybe they're actually better.

What does C control?

C = 0

Pure exploitation. Pours every simulation into the currently-best child. Can miss the truly best move because it never tested it.

C = √2 ≈ 1.414

Balanced (theoretical optimum). Spreads visits but still skews toward good moves. The standard default.

C = 5

Heavy exploration. Visits are nearly uniform — MCTS can't distinguish good from bad in time.

UCB1 in code

node.py
Python 3
1
2
3
4
5
6
7
8
9
def ucb1(self, exploration_constant=math.sqrt(2)):
    if self.visit_count == 0:
        return float("inf")  # always explore unvisited children first

    exploitation = self.total_reward / self.visit_count
    exploration  = exploration_constant * math.sqrt(
        math.log(self.parent.visit_count) / self.visit_count
    )
    return exploitation + exploration

A subtle but critical detail: N(v) == 0 returns +∞, so every child is tried at least once before scores are compared.

Reward convention

Outcome	Reward (Player 1's view)
Player 1 wins	1.0
Draw	0.5
Player 2 wins	0.0

Inside backprop, each node stores the reward from the perspective of the player who moved to it — so UCB1 maximisation at every level is correct without manually flipping signs at each step.

UCB vs. ε-greedy vs. Random — live K-armed bandit

Each arm is a slot machine with a hidden true reward (the tower's height). Three agents pull arms in parallel and try to find the tallest one. UCB1 explores until it's confident, then exploits. ε-greedy mostly exploits but takes random shots. Random doesn't learn. Watch the regret chart below — UCB's gap to the optimal arm shrinks fastest.

K-armed bandit · Bernoulli rewards

Step · 0

Last step (no pulls yet)

UCB1

—

Number of arms 3

UCB · C 1.41

ε-greedy · ε 0.10

ε-decay / step 0.0000

Speed (pulls/s) 20

UCB1

0.000

cum. regret · avg 0.000 · 0 pulls

Cumulative regret · lower is better

Pick a starting position

Game

Turn

Click cells to edit — pick a starting position

Step-by-step learning · watch every phase in 3D

Click Next phase to advance MCTS one phase at a time. The active phase glows, the search path lights up, and the tree builds in front of you. Edit the board above and hit ↺ Reset tree to start over from a new position.

Phase 1

Selection

walk down via UCB1

Phase 2

Expansion

add a new child

Phase 3

Simulation

random rollout

Phase 4

Backpropagation

push reward up

C (UCB1) 1.414

Speed 1.0×

Iter 0

No iterations yet. Click Next phase to start.

Play against MCTS

Game

You are O. MCTS is X and plays first.

Difficulty controls

Simulations 800

C (UCB1) 1.414

Last move stats — MCTS just played

Move chosen

—

Win rate (X)

—

Top-3 candidates considered

No moves yet

Session

MCTS Wins

Your Wins

Draws

MCTS hasn't moved yet — make your move (or pick "MCTS plays first") to see the search tree.

What's happening under the hood?

Every time MCTS thinks, it builds an entirely fresh tree from the current board, runs the chosen number of simulations, and picks the most-visited child. Try setting Simulations = 10 and C = 0 to see how badly MCTS plays when starved of both budget and exploration.

THE PROBLEM Go is far too large for MCTS on its own

MCTS works by growing a search tree and scoring moves with random play-out episodes. That is fine when the state space is small — but Go's is astronomically large:

TIC-TAC-TOE

~10³

states — plain MCTS solves it perfectly

CHESS

~10⁴⁴

positions — hard, but search can cope

GO · 19×19

~10¹⁷⁰

legal positions · branching ≈ 250 · game tree ≈ 10³⁶⁰

Why pure MCTS breaks here:

The tree can only ever touch a vanishing fraction of ~10³⁶⁰ nodes, so it has to be made enormous to be useful at all.
Each leaf is scored by a random rollout — a whole game of uniformly random moves. In Go that is almost meaningless noise.
To average that noise out you'd need on the order of 10⁵–10⁶ random episodes per move — and it would still play weakly.

→ The fix is not a bigger tree or more random games. It is to stop searching blindly — inject a learned policy and value. That is what the rest of this tab is about.

A single network with two heads — and the core observation

ALPHAGO · 2016

AlphaGo used two separate networks. The ① policy network is trained on a huge number of real human games — about 30 million expert moves from the KGS server — so it predicts the move a strong human would play. It drives Selection as the search prior. The ② value network predicts who will win, and drives Simulation, replacing the random rollout. Expansion and Backpropagation are left exactly as in vanilla MCTS.

💡 FUN FACT — THE STRONGER NET THAT NEVER PLAYED

AlphaGo actually trained two policy networks: an SL policy (network 1, learned by imitating ~30M human moves) and a stronger RL policy (network 2, improved by playing itself). You'd expect the stronger RL net to be the one wired into the MCTS — but in the historic match against Lee Sedol it was never used inside the search tree at all.

DeepMind found the SL policy guided Selection & Expansion better. Having only ever played itself, the RL net's vision was narrow — fixated on moves that were optimal against itself. The human-trained SL net offered a broader, more natural variety of moves, so the tree explored far more directions.

Then why train the RL net at all? It was a data factory: it played ~30 million games against itself, and the outcomes of those games were the sole training data for the value network.

ALPHAZERO · 2017

This is one single network — a shared trunk that splits into two output heads. There are no human games at all: it is trained entirely from its own MCTS self-play. The MCTS visit counts train the policy head, and the game outcomes train the value head. One model, learned by playing itself.

POLICY HEAD · P(a∣s)

Where to look

A probability over all legal moves. Role: the prior used during Selection.

Trained from: the MCTS visit distribution in self-play.

VALUE HEAD · V(s)

How good is this position

One scalar in [−1,+1]. Role: the leaf score in Simulation — replaces the rollout.

Trained from: the actual self-play outcome z.

Deep MCTS architecture

Each node shows value / visits. One search iteration — but Selection uses the two-head net's policy prior on the first move from the root, and Simulation is its value head, not a random play-out.

PHASE 01Selection

Descend using the policy head's prior P(s,a) — it biases the very first move from the root (no UCB1).

PHASE 02Expansion

Add one new child (a fresh 0/0); its move priors also come from the policy head.

PHASE 03Simulation

No random play-out. The value head of the two-head net scores the new leaf: V(s)=+0.46.

PHASE 04Backpropagation

Walk back up: visits +1; the value-head estimate is folded into each Q, alternating by player.

These four phases repeat hundreds of times per move. Every iteration only updates the MCTS — the tree's visit counts and value averages. The two-head network is frozen the whole time; its weights never change here.

Self-play loop

Steps 5–8 wrap that search in self-play. This is the only place the network learns — these steps update the DNN (a gradient step on the weights θ). The MCTS tree, by contrast, is rebuilt from scratch for every move.

Network plays itself using MCTS guided by its own current policy + value heads. No human data needed.
Each move yields a record: the state s, the MCTS visit distribution π at that state, and the final game outcome z (±1 / 0).
Train the network: minimise KL(policy, π) + (V − z)² + L2 regularisation. Policy learns from MCTS; value learns from actual outcomes.
Better network → stronger MCTS → harder self-play → better data. Repeat for days; the curriculum scales with compute.

From AlphaGo to MuZero

2016

AlphaGo

Supervised pre-train on human pro games + RL self-play.

Rollouts kept — averaged with value net.

Beat Lee Sedol 4–1 at Go.

2017

AlphaGo Zero

No human data — pure self-play from random init.

Rollouts removed; value head only.

Surpassed AlphaGo in 72 hours.

2017

AlphaZero

Same algorithm — chess, shogi, Go.

Beat Stockfish (chess) and Elmo (shogi) in 24h.

Demonstrated the framework's generality.

2019

MuZero

No game rules required — learns a dynamics model.

Plans in a learned latent space, not the true state.

Atari, Go, chess and shogi with one architecture.

Beyond games — same recipe everywhere

AlphaFold (2020)

Uses: predicts a protein's 3-D structure from its amino-acid sequence — reshaped structural biology and drug discovery. Architecture: the Evoformer — attention over the multiple-sequence alignment plus a pairwise residue representation — feeding a structure module that outputs 3-D coordinates end-to-end.

AlphaTensor (2022)

Uses: discovers faster matrix-multiplication algorithms (beat Strassen on 4×4 mod-2) — automated algorithm discovery. Architecture: recasts the task as a single-player game and adds a transformer policy/value network to AlphaZero's MCTS, searching over tensor decompositions.

FunSearch (2023)

Uses: finds new mathematical constructions and heuristics (cap-set lower bound, better online bin-packing). Architecture: pairs a pretrained LLM code-proposer with an automated evaluator in an evolutionary "islands" loop — search over programs, not weights.

AlphaChip (2024)

Uses: chip floorplanning — places macro blocks for real Google TPU layouts. Architecture: RL with an edge-based graph neural network encoding the netlist to produce a placement policy; rewards from wirelength/congestion proxies, with transfer across chips.