Home
03 · Planning

Monte Carlo Tree Search

An interactive MCTS lab — authored by Zvika Berger

What is MCTS?

In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree.

MCTS was combined with neural networks in 2016 and has been used in multiple board games like Chess, Shogi, Checkers, Backgammon, Contract Bridge, Go, Scrabble, and Clobber as well as in turn-based-strategy video games (such as Total War: Rome II's implementation in the high-level campaign AI) and applications outside of games.

— Adapted from Wikipedia
The problem: game trees explode

A game tree grows exponentially with depth. If a position has on average b legal moves and the game lasts d plies, the total number of leaves is roughly bd. Even modest values make exhaustive search hopeless.

Game Avg. branching Avg. depth Total positions ≈
Tic-Tac-Toe ~ 5 9 ~ 2.6 × 105 (solvable)
Chess ~ 35 ~ 80 ~ 10120 (Shannon number)
Go (19×19) ~ 250 ~ 150 ~ 10360 (more than atoms in the universe)

Minimax visits every leaf of the tree to compute the exact value. That's fine for tic-tac-toe — but for chess (10120) it would take longer than the age of the universe.

Alpha-Beta pruning helps, but only as a constant-factor improvement: with perfect move ordering it cuts the effective branching factor from b down to roughly √b, so the work becomes ~bd/2. For chess that's still ~1060, for go ~10180 — still impossible. Move ordering also depends on a strong heuristic evaluation function, which is what makes alpha-beta so hard to apply to games like go where no simple evaluation exists.

MCTS dodges the problem entirely. It never enumerates the full tree — it samples the most promising branches via random rollouts and grows the search tree asymmetrically. With the same compute budget it can effectively search much deeper than minimax / alpha-beta, which is why it powered AlphaGo and modern game engines.

When each method works · when it fails
Method Works when… Fails when…
Minimax Tree is small (~106 nodes) Branching factor is huge (Go has ~250 moves/turn)
Alpha-Beta Good move ordering + a strong evaluation function exist Branching factor is still very large, or no simple eval (e.g. Go)
MCTS Simulation is fast and cheap; no evaluation function needed Simulation is slow or reward is delayed past many moves
Where MCTS is used in the real world
AlphaGo / AlphaZero
Go has ~250 moves per turn. Exhaustive search is impossible — MCTS paired with a neural network solved it.
Chess engines
Leela Chess Zero combines MCTS with deep neural value/policy nets.
Video-game AI
Real-time strategy games with huge action spaces and uncertainty.
Retrosynthesis
MCTS plans multi-step chemical syntheses — searching trees of reactions to find a route to a target molecule (e.g. AiZynthFinder).
Robot planning
Path planning in uncertain environments with sparse reward signals.
Combinatorial opt.
Scheduling, bin packing, travelling-salesman variants.
The four phases of one MCTS iteration
PHASE 01
Selection
Starting at the root, walk DOWN the tree, always choosing the child with the highest UCB1 score. Stop at a node that still has untried actions or is terminal.
PHASE 02
Expansion
Pick one untried action from the leaf, create the corresponding new child node, and attach it to the tree.
PHASE 03
Simulation
From the new child, play out random moves until the game ends. Record the outcome (win / lose / draw).
PHASE 04
Backpropagation
Walk back UP to the root. At every node on the path, increment N(v) and add the reward to Q(v).
The MCTS Loop
ROOT 1 · SELECTION walk down via UCB1 2 · EXPANSION add one child 3 · SIMULATION random rollout 4 · BACKPROP update N, Q up tree Repeat thousands of times → pick most-visited child

Each iteration of MCTS is exactly four phases. After many iterations the algorithm converges to a strong recommendation.

Worked example — one full MCTS iteration

Following the classic Wikipedia tree — each node shows wins / visits. The four panels below trace one full iteration from root 11/21 through the selection chain to a new leaf, a random rollout that wins, and the resulting backpropagated counts.

PHASE 01Selection
Walk down via UCB1, always picking the most-promising child. Stop at a leaf with untried actions.
PHASE 02Expansion
Add one new child below the selected leaf (a fresh node initialised to 0/0).
PHASE 03Simulation
Random rollout from the new child until terminal. Outcome here: a win.
PHASE 04Backpropagation
Walk back up. Visits go +1 everywhere, but the reward alternates by player: +1/+1 on the winning side's nodes, −1/+1 on the opponent's.
In Python — the entire algorithm
mcts.py
Python 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
# mcts.py — the complete MCTS algorithm
import math, random

class MCTSNode:
    def __init__(self, state, parent=None, action_taken=None):
        self.state, self.parent = state, parent
        self.action_taken = action_taken
        self.children = []
        self.visit_count = 0
        self.total_reward = 0.0
        self.untried_actions = state.legal_actions()

    @property
    def is_terminal(self):       return self.state.is_terminal()
    @property
    def is_fully_expanded(self): return len(self.untried_actions) == 0

    def ucb1(self, C=math.sqrt(2)):
        if self.visit_count == 0: return float("inf")
        mean = self.total_reward / self.visit_count
        return mean + C * math.sqrt(math.log(self.parent.visit_count) / self.visit_count)

    def best_child(self, C):
        return max(self.children, key=lambda c: c.ucb1(C))

    def expand(self):
        action = self.untried_actions.pop()
        child  = MCTSNode(self.state.apply(action), parent=self, action_taken=action)
        self.children.append(child)
        return child


class MCTS:
    def __init__(self, C=math.sqrt(2)):
        self.C = C

    def search(self, root_state, num_simulations=1000):
        root = MCTSNode(state=root_state)
        for _ in range(num_simulations):
            node = self._select(root)            # Phase 1 — SELECTION
            if not node.is_terminal:             # Phase 2 — EXPANSION
                node = node.expand()
            reward = self._simulate(node.state)  # Phase 3 — SIMULATION
            self._backpropagate(node, reward)    # Phase 4 — BACKPROP
        return max(root.children, key=lambda c: c.visit_count).action_taken

    def _select(self, node):
        while not node.is_terminal and node.is_fully_expanded:
            node = node.best_child(self.C)
        return node

    def _simulate(self, state):
        while not state.is_terminal():
            state = state.apply(random.choice(state.legal_actions()))
        return state.reward()

    def _backpropagate(self, node, reward):
        while node is not None:
            node.visit_count += 1
            node.total_reward += reward
            reward = 1.0 - reward             # flip for opponent's perspective
            node = node.parent
Why most-visited, not highest Q-value?
After all simulations, you might think to choose the child with the highest average reward. Don't.
Unreliable
A node with 3 visits, all wins has Q = 1.0 — but it was barely explored. The 1.0 could be luck.
Reliable
A node with 300 visits, 70 % win rate has Q = 0.7 — heavily sampled, so the estimate is trustworthy.

UCB1 already directed effort to the truly best moves, so visit-count is essentially MCTS's "confidence vote."

The UCB1 formula
Upper Confidence Bound 1
UCB1(v) = μv + C · √( ln N / N(v) )
Exploitation · Q / N
The average reward seen through this child. High values mean "this move has been good so far — keep using it."
Exploration · C · √(ln Np / N)
Grows when this node is under-visited. Forces MCTS to occasionally try children that don't look the best yet — maybe they're actually better.
What does C control?
C = 0
Pure exploitation. Pours every simulation into the currently-best child. Can miss the truly best move because it never tested it.
C = √2 ≈ 1.414
Balanced (theoretical optimum). Spreads visits but still skews toward good moves. The standard default.
C = 5
Heavy exploration. Visits are nearly uniform — MCTS can't distinguish good from bad in time.
UCB1 in code
node.py
Python 3
1 2 3 4 5 6 7 8 9
def ucb1(self, exploration_constant=math.sqrt(2)):
    if self.visit_count == 0:
        return float("inf")  # always explore unvisited children first

    exploitation = self.total_reward / self.visit_count
    exploration  = exploration_constant * math.sqrt(
        math.log(self.parent.visit_count) / self.visit_count
    )
    return exploitation + exploration

A subtle but critical detail: N(v) == 0 returns +∞, so every child is tried at least once before scores are compared.

Reward convention
Outcome Reward (Player 1's view)
Player 1 wins1.0
Draw0.5
Player 2 wins0.0

Inside backprop, each node stores the reward from the perspective of the player who moved to it — so UCB1 maximisation at every level is correct without manually flipping signs at each step.

UCB vs. ε-greedy vs. Random — live K-armed bandit

Each arm is a slot machine with a hidden true reward (the tower's height). Three agents pull arms in parallel and try to find the tallest one. UCB1 explores until it's confident, then exploits. ε-greedy mostly exploits but takes random shots. Random doesn't learn. Watch the regret chart below — UCB's gap to the optimal arm shrinks fastest.

K-armed bandit · Bernoulli rewards
Step · 0
Last step (no pulls yet)
UCB1
UCB1
0.000
cum. regret · avg 0.000 · 0 pulls
Cumulative regret · lower is better
Pick a starting position
Game
Turn
Click cells to edit — pick a starting position
Step-by-step learning · watch every phase in 3D

Click Next phase to advance MCTS one phase at a time. The active phase glows, the search path lights up, and the tree builds in front of you. Edit the board above and hit ↺ Reset tree to start over from a new position.

Phase 1
Selection
walk down via UCB1
Phase 2
Expansion
add a new child
Phase 3
Simulation
random rollout
Phase 4
Backpropagation
push reward up
C (UCB1) 1.414
Speed 1.0×
Iter 0
No iterations yet. Click Next phase to start.
Play against MCTS
Game
You are O. MCTS is X and plays first.
Difficulty controls
Simulations 800
C (UCB1) 1.414
Last move stats — MCTS just played
Move chosen
Win rate (X)
Top-3 candidates considered
No moves yet
Session
MCTS Wins
0
Your Wins
0
Draws
0
MCTS hasn't moved yet — make your move (or pick "MCTS plays first") to see the search tree.
What's happening under the hood?
Every time MCTS thinks, it builds an entirely fresh tree from the current board, runs the chosen number of simulations, and picks the most-visited child. Try setting Simulations = 10 and C = 0 to see how badly MCTS plays when starved of both budget and exploration.
THE PROBLEM Go is far too large for MCTS on its own

MCTS works by growing a search tree and scoring moves with random play-out episodes. That is fine when the state space is small — but Go's is astronomically large:

TIC-TAC-TOE
~103
states — plain MCTS solves it perfectly
CHESS
~1044
positions — hard, but search can cope
GO · 19×19
~10170
legal positions · branching ≈ 250 · game tree ≈ 10360

Why pure MCTS breaks here:

  • The tree can only ever touch a vanishing fraction of ~10360 nodes, so it has to be made enormous to be useful at all.
  • Each leaf is scored by a random rollout — a whole game of uniformly random moves. In Go that is almost meaningless noise.
  • To average that noise out you'd need on the order of 105–106 random episodes per move — and it would still play weakly.

→ The fix is not a bigger tree or more random games. It is to stop searching blindly — inject a learned policy and value. That is what the rest of this tab is about.

A single network with two heads — and the core observation
ALPHAGO · 2016 — TWO SEPARATE NETWORKS, EACH DRIVES ONE PHASE State s ① Policy network trained on human games State s ② Value network predicts the winner 1 · SELECTIONprior ← ① policy net 2 · EXPANSIONunchanged 3 · SIMULATIONvalue ← ② value net 4 · BACKPROPAGATIONunchanged prior P(s,a) value · no rollout
ALPHAGO · 2016
AlphaGo used two separate networks. The ① policy network is trained on a huge number of real human games — about 30 million expert moves from the KGS server — so it predicts the move a strong human would play. It drives Selection as the search prior. The ② value network predicts who will win, and drives Simulation, replacing the random rollout. Expansion and Backpropagation are left exactly as in vanilla MCTS.
💡 FUN FACT — THE STRONGER NET THAT NEVER PLAYED

AlphaGo actually trained two policy networks: an SL policy (network 1, learned by imitating ~30M human moves) and a stronger RL policy (network 2, improved by playing itself). You'd expect the stronger RL net to be the one wired into the MCTS — but in the historic match against Lee Sedol it was never used inside the search tree at all.

DeepMind found the SL policy guided Selection & Expansion better. Having only ever played itself, the RL net's vision was narrow — fixated on moves that were optimal against itself. The human-trained SL net offered a broader, more natural variety of moves, so the tree explored far more directions.

Then why train the RL net at all? It was a data factory: it played ~30 million games against itself, and the outcomes of those games were the sole training data for the value network.

ALPHAZERO · 2017 — ONE NETWORK, TWO HEADS, EACH DRIVES ONE PHASEState sONE networkshared trunk · ResNet ~20Policy head · P(a∣s)+0.46Value head · V(s)1 · SELECTIONprior ← policy head2 · EXPANSIONunchanged3 · SIMULATIONvalue ← value head4 · BACKPROPAGATIONunchangedprior P(s,a)value · no rollout
ALPHAZERO · 2017
This is one single network — a shared trunk that splits into two output heads. There are no human games at all: it is trained entirely from its own MCTS self-play. The MCTS visit counts train the policy head, and the game outcomes train the value head. One model, learned by playing itself.
POLICY HEAD · P(a∣s)
Where to look

A probability over all legal moves. Role: the prior used during Selection.

Trained from: the MCTS visit distribution in self-play.

VALUE HEAD · V(s)
How good is this position

One scalar in [−1,+1]. Role: the leaf score in Simulation — replaces the rollout.

Trained from: the actual self-play outcome z.

Deep MCTS architecture

Each node shows value / visits. One search iteration — but Selection uses the two-head net's policy prior on the first move from the root, and Simulation is its value head, not a random play-out.

PHASE 01Selection
11/2111/217/107/100/30/35/85/82/42/45/65/62/32/3TWO-HEAD NETpolicyvalueprior P(s,a)
Descend using the policy head's prior P(s,a) — it biases the very first move from the root (no UCB1).
PHASE 02Expansion
11/2111/217/107/100/30/35/85/82/42/45/65/62/32/30/00/0
Add one new child (a fresh 0/0); its move priors also come from the policy head.
PHASE 03Simulation
11/2111/217/107/100/30/35/85/82/42/45/65/62/32/30/00/0TWO-HEAD NETpolicyvalueV = +0.46NO ROLLOUT · NETWORK SCORES THE LEAF
No random play-out. The value head of the two-head net scores the new leaf: V(s)=+0.46.
PHASE 04Backpropagation
11/2211/228/118/110/30/35/85/82/42/45/75/73/43/40/10/1
Walk back up: visits +1; the value-head estimate is folded into each Q, alternating by player.
These four phases repeat hundreds of times per move. Every iteration only updates the MCTS — the tree's visit counts and value averages. The two-head network is frozen the whole time; its weights never change here.
Self-play loop
Steps 5–8 wrap that search in self-play. This is the only place the network learns — these steps update the DNN (a gradient step on the weights θ). The MCTS tree, by contrast, is rebuilt from scratch for every move.
NETWORK policy + value MCTS ~800 sims/move SELF-PLAY net plays itself · games TRAINING (state, π, z) → loss guides search plays games collects records updates weights
  1. Network plays itself using MCTS guided by its own current policy + value heads. No human data needed.
  2. Each move yields a record: the state s, the MCTS visit distribution π at that state, and the final game outcome z (±1 / 0).
  3. Train the network: minimise KL(policy, π) + (V − z)2 + L2 regularisation. Policy learns from MCTS; value learns from actual outcomes.
  4. Better network → stronger MCTS → harder self-play → better data. Repeat for days; the curriculum scales with compute.
From AlphaGo to MuZero
2016
AlphaGo
Supervised pre-train on human pro games + RL self-play.
Rollouts kept — averaged with value net.
Beat Lee Sedol 4–1 at Go.
2017
AlphaGo Zero
No human data — pure self-play from random init.
Rollouts removed; value head only.
Surpassed AlphaGo in 72 hours.
2017
AlphaZero
Same algorithm — chess, shogi, Go.
Beat Stockfish (chess) and Elmo (shogi) in 24h.
Demonstrated the framework's generality.
2019
MuZero
No game rules required — learns a dynamics model.
Plans in a learned latent space, not the true state.
Atari, Go, chess and shogi with one architecture.
Beyond games — same recipe everywhere
AlphaFold (2020)
Uses: predicts a protein's 3-D structure from its amino-acid sequence — reshaped structural biology and drug discovery. Architecture: the Evoformer — attention over the multiple-sequence alignment plus a pairwise residue representation — feeding a structure module that outputs 3-D coordinates end-to-end.
AlphaTensor (2022)
Uses: discovers faster matrix-multiplication algorithms (beat Strassen on 4×4 mod-2) — automated algorithm discovery. Architecture: recasts the task as a single-player game and adds a transformer policy/value network to AlphaZero's MCTS, searching over tensor decompositions.
FunSearch (2023)
Uses: finds new mathematical constructions and heuristics (cap-set lower bound, better online bin-packing). Architecture: pairs a pretrained LLM code-proposer with an automated evaluator in an evolutionary "islands" loop — search over programs, not weights.
AlphaChip (2024)
Uses: chip floorplanning — places macro blocks for real Google TPU layouts. Architecture: RL with an edge-based graph neural network encoding the netlist to produce a placement policy; rewards from wirelength/congestion proxies, with transfer across chips.