Markov Decision Processes

An interactive MDP lab — authored by Zvika Berger

What is an MDP?

A Markov Decision Process (MDP) is a formal framework for modelling non-deterministic sequential decision problems. It is widely used across many fields, including robotics, automatic control, economics, and manufacturing.

◆The environment is fully observable — the agent sees the complete state at every moment

◆The environment is stochastic — the outcome of an action is not certain

◆Time-invariant — the dynamics do not change over time (stationary)

◆The problem is ongoing — it can run indefinitely until an external termination signal is received

◆The agent receives a reward on every action it takes — the goal is to maximise the total discounted reward over time

◆The agent maximises expected return — rather than simply reaching a goal state, it strives to accumulate as much reward as possible along the entire trajectory

The 5 Components

States

All situations the agent can be in.

Example: every (row, col) in a grid.

Actions

Decisions the agent can make in a state.

Example: {UP, DOWN, LEFT, RIGHT}.

Transitions

Transition Model

P(s′|s,a) — probability of landing in s′ after action a.

Captures stochasticity: wind, slipping, noise.

Rewards

Reward Function

R(s,a,s′) — immediate signal after a transition.

Encodes the goal: what the agent should maximise.

Gamma

Discount Factor

0 ≤ γ < 1 — how much future rewards are valued.

γ=0: only immediate. γ≈1: far-sighted.

What we know

P(s′|s,a) — transition probabilities
R(s,a,s′) — reward function

⟹

What we want

π*(s) — the optimal policy
the action at each state that maximises expected discounted return

An MDP formalises sequential decision-making: an Agent observes a state, takes an action, and the Environment returns a new state and a reward.

The Markov Property

s₀→ s₁→ s₂→ s₃→ s₄

The future depends only on the present state — not the past.
P(s₅ | s₄, s₃, s₂, s₁, s₀) = P(s₅ | s₄)

Once you know s₄, the entire history s₀…s₃ is irrelevant. This simplification makes MDPs tractable.

Key Takeaways

An MDP is defined by five components: S, A, T, R, and γ
The environment is fully observable, stochastic, and time-invariant
The agent selects actions; the environment transitions and emits a reward
The Markov property means only the current state matters — not history
The goal is to find the optimal policy π* that maximises expected discounted return

The 4×4 Grid World Environment

⟳ Loading 3D scene…

Steps

—

Reward

—

Outcome

—

Click "Run Random Policy" to start…

Grid Layout

Start (0,0)

Goal (3,3) +10

Hole (2,3) −10

Wall (1,1)

Each step costs −1. Terminal states: Goal and Hole.

Stochastic Transitions — from START, action RIGHT

80%

(0,1) intended

10%

(0,0) slip UP

10%

(1,0) slip DOWN

Transition Model

P(s′|s,a): 80% intended direction 10% slip left of a 10% slip right of a

Key Takeaways

The Grid World is a finite MDP with 16 states and 4 actions per state
Stochastic transitions (80/10/10) make the agent's path uncertain
Terminal states (Goal, Hole) end the episode immediately
A random policy will rarely reach the Goal — it needs to learn or plan
The environment is fully observable: agent always knows its (row, col)

Key Equations

Policy Evaluation

Given a fixed policy π, compute the value of every state:

V_π(s) = R(s) + γ Σ_s′ P(s′ | s, π(s)) · V_π(s′)

Optimal Policy

π* is the policy that produces the highest utility — choose the action that maximises expected future utility:

π*(s) = max_a Σ_s′ P(s, a, s′) V(s′)

V(s′) = Value of state s′

Bellman Equation

Expresses the value of a state recursively. Two equivalent forms:

V*(s) = R(s) + γ · max_a Σ_s′ P(s′ | s, a) · V*(s′)

When reward depends on the transition R(s, a, s′):

V*(s) = max_a Σ_s′ P(s, a, s′) [ R(s, a, s′) + γ V*(s′) ]

Value Iteration

Value Iteration (VI) finds the optimal value function V* by repeatedly applying the Bellman optimality update to every state until convergence. Once V* is known, the optimal policy π* is extracted in a single greedy pass — no model-free interaction needed.

Bellman Optimality Update (repeated until max|ΔV| < θ)

V(s) ← max_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ · V(s′) ]

Policy Extraction (once, after convergence)

π*(s) = argmax_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ · V*(s′) ]

Algorithm Flow

↻ Bellman sweeps until max|ΔV| < θ

V₀

→

V₁

→

···

→

⟹

π*

update value function · extract policy once at end

π₀

→

↻ Eval → Improve until policy stable

V^π

⇄

π′

⟹

π*

alternate full evaluation & greedy improvement

Policy Iteration

Policy Iteration (PI) alternates between two steps: Policy Evaluation — computing the exact value of the current policy — and Policy Improvement — making the policy greedy with respect to those values. The process terminates when the policy no longer changes, guaranteeing the optimal policy π*.

Step 1 — Policy Evaluation (solve for V^π, fixed policy)

V^π(s) = Σ_s′ P(s′|s,π(s)) [ R(s,π(s),s′) + γ · V^π(s′) ]

Step 2 — Policy Improvement (greedy update)

π′(s) = argmax_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ · V^π(s′) ]

Repeat until π′ = π everywhere → optimal policy found

Algorithm Flow

↻ Bellman sweeps until max|ΔV| < θ

V₀

→

V₁

→

···

→

⟹

π*

update value function · extract policy once at end

π₀

→

↻ Eval → Improve until policy stable

V^π

⇄

π′

⟹

π*

alternate full evaluation & greedy improvement

Finding π*

Algorithm Flow

↻ Bellman sweeps until max|ΔV| < θ

V₀

→

V₁

→

···

→

⟹

π*

update value function · extract policy once at end

π₀

→

↻ Eval → Improve until policy stable

V^π

⇄

π′

⟹

π*

alternate full evaluation & greedy improvement

Value Iteration

Bellman Optimality Update (repeated until max|ΔV| < θ)

V(s) ← max_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ · V(s′) ]

Policy Extraction (once, after convergence)

π*(s) = argmax_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ · V*(s′) ]

Algorithm Flow

↻ Bellman sweeps until max|ΔV| < θ

V₀

→

V₁

→

···

→

⟹

π*

update value function · extract policy once at end

π₀

→

↻ Eval → Improve until policy stable

V^π

⇄

π′

⟹

π*

alternate full evaluation & greedy improvement

Interactive Visualisation

γ discount 0.90 Higher γ = more far-sighted · may need more iterations

Convergence metric Standard stopping criterion · guaranteed convergence to V*

Iteration

—

Max |ΔV|

—

V(start)

—

of N iters

—

3D Value Map tile height = V(s) · drag to rotate

⟳ Loading 3D scene…

Low V

High V

Greedy Policy π(s)

arrow = best action · corner = V(s)

Convergence — Max |ΔV| per iteration

Key Takeaways

Value Iteration applies the Bellman optimality update repeatedly until values converge
Convergence is guaranteed — the Bellman operator is a contraction mapping
Tile height shows value: Goal area rises, Hole area sinks, path cells gradually lift
Higher γ makes the agent more far-sighted and may require more iterations
Once V* is found, the optimal policy π* is extracted in a single pass

Policy Iteration

Step 1 — Policy Evaluation (solve for V^π, fixed policy)

V^π(s) = Σ_s′ P(s′|s,π(s)) [ R(s,π(s),s′) + γ · V^π(s′) ]

Step 2 — Policy Improvement (greedy update)

π′(s) = argmax_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ · V^π(s′) ]

Repeat until π′ = π everywhere → optimal policy found

The Two-Phase Cycle

Policy Evaluation

Compute V^π by solving
the Bellman expectation
equation for fixed π

→

Policy Improvement

Make π greedy w.r.t. V^π
π_new(s) = argmax_a Q(s,a)

→

Convergence

If π_new = π everywhere
→ Optimal policy found!

Policy Evaluation (Bellman Expectation)

V^π(s) = Σ_s′ P(s′|s,π(s)) [ R(s,π(s),s′) + γ · V^π(s′) ]

Algorithm Flow

↻ Bellman sweeps until max|ΔV| < θ

V₀

→

V₁

→

···

→

⟹

π*

update value function · extract policy once at end

π₀

→

↻ Eval → Improve until policy stable

V^π

⇄

π′

⟹

π*

alternate full evaluation & greedy improvement

Interactive Visualisation

γ discount 0.90 Higher γ = more far-sighted · may need more iterations

Iteration

—

Cells Changed

—

V(start)

—

Status

—

3D Policy Map cone = π(s) · drag to rotate

⟳ Loading 3D scene…

▲ UP ▼ DOWN ◀ LEFT ▶ RIGHT

Policy π^k(s)

arrow = action · corner = V^π(s)

Iteration Log

Press "Next →" to step through iterations…

VI vs PI Comparison

Property	Value Iteration	Policy Iteration
Outer iterations	—	—
V*(start)	—	—
Inner loop	One Bellman sweep per iter	Full policy eval to convergence
Needs model?	Yes (T and R)	Yes (T and R)
Convergence	Guaranteed (contraction)	Guaranteed (finite policies)
Typical speed	More outer iters	Fewer outer iters

Key Takeaways

Policy Iteration alternates between full evaluation and greedy improvement
It always converges — the policy space is finite so improvements must terminate
Both VI and PI find the exact same optimal policy π*
PI typically needs fewer outer iterations because evaluation is exact each time
Both algorithms require knowing the model T and R — they are model-based planners

Grid Designer Sandbox

Build your own MDP · tune probabilities & discount · solve with VI or PI · evaluate the policy

Grid Editor

Size ×

Paint

Step cost per non-terminal step

Algorithm Settings

Algorithm

Metric

γ discount 0.90

Transition Probabilities

Intended 80%

Slip Left 10%

Slip Right 10%

Sum: 100% ✓

Presets

S = Start · = Normal (step cost) █ = Wall

G = Goal (positive terminal) H = Hole (negative terminal)

$ = Custom terminal reward Right-click = remove