Home

Markov Decision Processes

An interactive MDP lab — authored by Zvika Berger

What is an MDP?

A Markov Decision Process (MDP) is a formal framework for modelling non-deterministic sequential decision problems. It is widely used across many fields, including robotics, automatic control, economics, and manufacturing.

The environment is fully observable — the agent sees the complete state at every moment
The environment is stochastic — the outcome of an action is not certain
Time-invariant — the dynamics do not change over time (stationary)
The problem is ongoing — it can run indefinitely until an external termination signal is received
The agent receives a reward on every action it takes — the goal is to maximise the total discounted reward over time
The agent maximises expected return — rather than simply reaching a goal state, it strives to accumulate as much reward as possible along the entire trajectory
The 5 Components
S
States
States
All situations the agent can be in.

Example: every (row, col) in a grid.
A
Actions
Actions
Decisions the agent can make in a state.

Example: {UP, DOWN, LEFT, RIGHT}.
T
Transitions
Transition Model
P(s′|s,a) — probability of landing in s′ after action a.

Captures stochasticity: wind, slipping, noise.
R
Rewards
Reward Function
R(s,a,s′) — immediate signal after a transition.

Encodes the goal: what the agent should maximise.
γ
Gamma
Discount Factor
0 ≤ γ < 1 — how much future rewards are valued.

γ=0: only immediate. γ≈1: far-sighted.
What we know
P(s′|s,a) — transition probabilities
R(s,a,s′) — reward function
What we want
π*(s) — the optimal policy
the action at each state that maximises expected discounted return
AGENT π(s) → a ENVIRONMENT T(s,a,s') R(s,a) action a state s′, reward r

An MDP formalises sequential decision-making: an Agent observes a state, takes an action, and the Environment returns a new state and a reward.

The Markov Property
s₀ s₁ s₂ s₃ s₄

The future depends only on the present state — not the past.
P(s₅ | s₄, s₃, s₂, s₁, s₀) = P(s₅ | s₄)

Once you know s₄, the entire history s₀…s₃ is irrelevant. This simplification makes MDPs tractable.

Key Takeaways
  • An MDP is defined by five components: S, A, T, R, and γ
  • The environment is fully observable, stochastic, and time-invariant
  • The agent selects actions; the environment transitions and emits a reward
  • The Markov property means only the current state matters — not history
  • The goal is to find the optimal policy π* that maximises expected discounted return
The 4×4 Grid World Environment
⟳ Loading 3D scene…
Steps
Reward
Outcome
Click "Run Random Policy" to start…
Grid Layout
Start (0,0)
Goal (3,3) +10
Hole (2,3) −10
Wall (1,1)

Each step costs −1. Terminal states: Goal and Hole.

Stochastic Transitions — from START, action RIGHT
80%
(0,1) intended
10%
(0,0) slip UP
10%
(1,0) slip DOWN
Transition Model
P(s′|s,a): 80% intended direction 10% slip left of a 10% slip right of a
Key Takeaways
  • The Grid World is a finite MDP with 16 states and 4 actions per state
  • Stochastic transitions (80/10/10) make the agent's path uncertain
  • Terminal states (Goal, Hole) end the episode immediately
  • A random policy will rarely reach the Goal — it needs to learn or plan
  • The environment is fully observable: agent always knows its (row, col)
Key Equations
Policy Evaluation
Given a fixed policy π, compute the value of every state:
Vπ(s) = R(s) + γ Σs′ P(s′ | s, π(s)) · Vπ(s′)
Optimal Policy
π* is the policy that produces the highest utility — choose the action that maximises expected future utility:
π*(s) = maxa Σs′ P(s, a, s′) V(s′)
V(s′) = Value of state s′
Bellman Equation
Expresses the value of a state recursively. Two equivalent forms:
V*(s) = R(s) + γ · maxa Σs′ P(s′ | s, a) · V*(s′)
When reward depends on the transition R(s, a, s′):
V*(s) = maxa Σs′ P(s, a, s′) [ R(s, a, s′) + γ V*(s′) ]
Value Iteration

Value Iteration (VI) finds the optimal value function V* by repeatedly applying the Bellman optimality update to every state until convergence. Once V* is known, the optimal policy π* is extracted in a single greedy pass — no model-free interaction needed.

Bellman Optimality Update (repeated until max|ΔV| < θ)
V(s) ← maxa Σs′ P(s′|s,a) [ R(s,a,s′) + γ · V(s′) ]
Policy Extraction (once, after convergence)
π*(s) = argmaxa Σs′ P(s′|s,a) [ R(s,a,s′) + γ · V*(s′) ]
Algorithm Flow
VI
↻  Bellman sweeps until max|ΔV| < θ
V₀
V₁
···
V*
π*
update value function · extract policy once at end
PI
π₀
↻  Eval → Improve until policy stable
Vπ
π′
π*
alternate full evaluation & greedy improvement
Policy Iteration

Policy Iteration (PI) alternates between two steps: Policy Evaluation — computing the exact value of the current policy — and Policy Improvement — making the policy greedy with respect to those values. The process terminates when the policy no longer changes, guaranteeing the optimal policy π*.

Step 1 — Policy Evaluation (solve for Vπ, fixed policy)
Vπ(s) = Σs′ P(s′|s,π(s)) [ R(s,π(s),s′) + γ · Vπ(s′) ]
Step 2 — Policy Improvement (greedy update)
π′(s) = argmaxa Σs′ P(s′|s,a) [ R(s,a,s′) + γ · Vπ(s′) ]
Repeat until π′ = π everywhere → optimal policy found
Algorithm Flow
VI
↻  Bellman sweeps until max|ΔV| < θ
V₀
V₁
···
V*
π*
update value function · extract policy once at end
PI
π₀
↻  Eval → Improve until policy stable
Vπ
π′
π*
alternate full evaluation & greedy improvement
Finding π*
Algorithm Flow
VI
↻  Bellman sweeps until max|ΔV| < θ
V₀
V₁
···
V*
π*
update value function · extract policy once at end
PI
π₀
↻  Eval → Improve until policy stable
Vπ
π′
π*
alternate full evaluation & greedy improvement
Value Iteration

Value Iteration (VI) finds the optimal value function V* by repeatedly applying the Bellman optimality update to every state until convergence. Once V* is known, the optimal policy π* is extracted in a single greedy pass — no model-free interaction needed.

Bellman Optimality Update (repeated until max|ΔV| < θ)
V(s) ← maxa Σs′ P(s′|s,a) [ R(s,a,s′) + γ · V(s′) ]
Policy Extraction (once, after convergence)
π*(s) = argmaxa Σs′ P(s′|s,a) [ R(s,a,s′) + γ · V*(s′) ]
Algorithm Flow
VI
↻  Bellman sweeps until max|ΔV| < θ
V₀
V₁
···
V*
π*
update value function · extract policy once at end
PI
π₀
↻  Eval → Improve until policy stable
Vπ
π′
π*
alternate full evaluation & greedy improvement
Interactive Visualisation
γ discount 0.90 Higher γ = more far-sighted · may need more iterations
Convergence metric Standard stopping criterion · guaranteed convergence to V*
Iteration
Max |ΔV|
V(start)
of N iters
3D Value Map tile height = V(s) · drag to rotate
⟳ Loading 3D scene…
Low V
High V
Greedy Policy π(s)
arrow = best action · corner = V(s)
Convergence — Max |ΔV| per iteration
Key Takeaways
  • Value Iteration applies the Bellman optimality update repeatedly until values converge
  • Convergence is guaranteed — the Bellman operator is a contraction mapping
  • Tile height shows value: Goal area rises, Hole area sinks, path cells gradually lift
  • Higher γ makes the agent more far-sighted and may require more iterations
  • Once V* is found, the optimal policy π* is extracted in a single pass
Policy Iteration

Policy Iteration (PI) alternates between two steps: Policy Evaluation — computing the exact value of the current policy — and Policy Improvement — making the policy greedy with respect to those values. The process terminates when the policy no longer changes, guaranteeing the optimal policy π*.

Step 1 — Policy Evaluation (solve for Vπ, fixed policy)
Vπ(s) = Σs′ P(s′|s,π(s)) [ R(s,π(s),s′) + γ · Vπ(s′) ]
Step 2 — Policy Improvement (greedy update)
π′(s) = argmaxa Σs′ P(s′|s,a) [ R(s,a,s′) + γ · Vπ(s′) ]
Repeat until π′ = π everywhere → optimal policy found
The Two-Phase Cycle
Policy Evaluation
Compute Vπ by solving
the Bellman expectation
equation for fixed π
Policy Improvement
Make π greedy w.r.t. Vπ
π_new(s) = argmaxa Q(s,a)
Convergence
If π_new = π everywhere
→ Optimal policy found!
Policy Evaluation (Bellman Expectation)
Vπ(s) = Σs′ P(s′|s,π(s)) [ R(s,π(s),s′) + γ · Vπ(s′) ]
Algorithm Flow
VI
↻  Bellman sweeps until max|ΔV| < θ
V₀
V₁
···
V*
π*
update value function · extract policy once at end
PI
π₀
↻  Eval → Improve until policy stable
Vπ
π′
π*
alternate full evaluation & greedy improvement
Interactive Visualisation
γ discount 0.90 Higher γ = more far-sighted · may need more iterations
Iteration
Cells Changed
V(start)
Status
3D Policy Map cone = π(s) · drag to rotate
⟳ Loading 3D scene…
▲ UP ▼ DOWN ◀ LEFT ▶ RIGHT
Policy πk(s)
arrow = action · corner = Vπ(s)
Iteration Log
Press "Next →" to step through iterations…
VI vs PI Comparison
PropertyValue IterationPolicy Iteration
Outer iterations
V*(start)
Inner loopOne Bellman sweep per iterFull policy eval to convergence
Needs model?Yes (T and R)Yes (T and R)
ConvergenceGuaranteed (contraction)Guaranteed (finite policies)
Typical speedMore outer itersFewer outer iters
Key Takeaways
  • Policy Iteration alternates between full evaluation and greedy improvement
  • It always converges — the policy space is finite so improvements must terminate
  • Both VI and PI find the exact same optimal policy π*
  • PI typically needs fewer outer iterations because evaluation is exact each time
  • Both algorithms require knowing the model T and R — they are model-based planners
Grid Designer Sandbox
Build your own MDP · tune probabilities & discount · solve with VI or PI · evaluate the policy
Grid Editor
Size ×
Paint
Step cost per non-terminal step
Algorithm Settings
Algorithm
Metric
γ discount 0.90
Transition Probabilities
Intended 80%
Slip Left 10%
Slip Right 10%
Sum: 100% ✓
Presets
S = Start   · = Normal (step cost)   = Wall
G = Goal (positive terminal)   H = Hole (negative terminal)
$ = Custom terminal reward   Right-click = remove