A Markov Decision Process (MDP) is a formal framework for modelling non-deterministic sequential decision problems.
It is widely used across many fields, including robotics, automatic control, economics, and manufacturing.
◆The environment is fully observable — the agent sees the complete state at every moment
◆The environment is stochastic — the outcome of an action is not certain
◆Time-invariant — the dynamics do not change over time (stationary)
◆The problem is ongoing — it can run indefinitely until an external termination signal is received
◆The agent receives a reward on every action it takes — the goal is to maximise the total discounted reward over time
◆The agent maximises expected return — rather than simply reaching a goal state, it strives to accumulate as much reward as possible along the entire trajectory
The 5 Components
S
States
States
All situations the agent can be in.
Example: every (row, col) in a grid.
A
Actions
Actions
Decisions the agent can make in a state.
Example: {UP, DOWN, LEFT, RIGHT}.
T
Transitions
Transition Model
P(s′|s,a) — probability of landing in s′ after action a.
Captures stochasticity: wind, slipping, noise.
R
Rewards
Reward Function
R(s,a,s′) — immediate signal after a transition.
Encodes the goal: what the agent should maximise.
γ
Gamma
Discount Factor
0 ≤ γ < 1 — how much future rewards are valued.
γ=0: only immediate. γ≈1: far-sighted.
What we know
P(s′|s,a) — transition probabilities R(s,a,s′) — reward function
⟹
What we want
π*(s) — the optimal policy the action at each state that maximises expected discounted return
An MDP formalises sequential decision-making: an Agent observes a state, takes an action, and the Environment returns a new state and a reward.
The Markov Property
s₀→s₁→s₂→s₃→s₄
The future depends only on the present state — not the past.
P(s₅ | s₄, s₃, s₂, s₁, s₀) = P(s₅ | s₄)
Once you know s₄, the entire history s₀…s₃ is irrelevant. This simplification makes MDPs tractable.
Key Takeaways
An MDP is defined by five components: S, A, T, R, and γ
The environment is fully observable, stochastic, and time-invariant
The agent selects actions; the environment transitions and emits a reward
The Markov property means only the current state matters — not history
The goal is to find the optimal policy π* that maximises expected discounted return
The 4×4 Grid World Environment
⟳ Loading 3D scene…
Steps
—
Reward
—
Outcome
—
Click "Run Random Policy" to start…
Grid Layout
Start (0,0)
Goal (3,3) +10
Hole (2,3) −10
Wall (1,1)
Each step costs −1. Terminal states: Goal and Hole.
Stochastic Transitions — from START, action RIGHT
80%
(0,1) intended
10%
(0,0) slip UP
10%
(1,0) slip DOWN
Transition Model
P(s′|s,a):
80% intended direction
10% slip left of a
10% slip right of a
Key Takeaways
The Grid World is a finite MDP with 16 states and 4 actions per state
Stochastic transitions (80/10/10) make the agent's path uncertain
Terminal states (Goal, Hole) end the episode immediately
A random policy will rarely reach the Goal — it needs to learn or plan
The environment is fully observable: agent always knows its (row, col)
Key Equations
Policy Evaluation
Given a fixed policy π, compute the value of every state:
Vπ(s) = R(s) + γΣs′P(s′ | s, π(s)) · Vπ(s′)
Optimal Policy
π* is the policy that produces the highest utility — choose the action that maximises expected future utility:
π*(s) = maxaΣs′P(s, a, s′) V(s′)
V(s′) = Value of state s′
Bellman Equation
Expresses the value of a state recursively. Two equivalent forms:
V*(s) = R(s) + γ · maxaΣs′P(s′ | s, a) · V*(s′)
When reward depends on the transition R(s, a, s′):
V*(s) = maxaΣs′P(s, a, s′) [ R(s, a, s′) + γV*(s′) ]
Value Iteration
Value Iteration (VI) finds the optimal value function V* by repeatedly applying the Bellman optimality update to every state until convergence. Once V* is known, the optimal policy π* is extracted in a single greedy pass — no model-free interaction needed.
Bellman Optimality Update (repeated until max|ΔV| < θ)
update value function · extract policy once at end
PI
π₀
→
↻ Eval → Improve until policy stable
Vπ
⇄
π′
⟹
π*
alternate full evaluation & greedy improvement
Policy Iteration
Policy Iteration (PI) alternates between two steps: Policy Evaluation — computing the exact value of the current policy — and Policy Improvement — making the policy greedy with respect to those values. The process terminates when the policy no longer changes, guaranteeing the optimal policy π*.
Step 1 — Policy Evaluation (solve for Vπ, fixed policy)
Repeat until π′ = π everywhere → optimal policy found
Algorithm Flow
VI
↻ Bellman sweeps until max|ΔV| < θ
V₀
→
V₁
→
···
→
V*
⟹
π*
update value function · extract policy once at end
PI
π₀
→
↻ Eval → Improve until policy stable
Vπ
⇄
π′
⟹
π*
alternate full evaluation & greedy improvement
Finding π*
Algorithm Flow
VI
↻ Bellman sweeps until max|ΔV| < θ
V₀
→
V₁
→
···
→
V*
⟹
π*
update value function · extract policy once at end
PI
π₀
→
↻ Eval → Improve until policy stable
Vπ
⇄
π′
⟹
π*
alternate full evaluation & greedy improvement
Value Iteration
Value Iteration (VI) finds the optimal value function V* by repeatedly applying the Bellman optimality update to every state until convergence. Once V* is known, the optimal policy π* is extracted in a single greedy pass — no model-free interaction needed.
Bellman Optimality Update (repeated until max|ΔV| < θ)
update value function · extract policy once at end
PI
π₀
→
↻ Eval → Improve until policy stable
Vπ
⇄
π′
⟹
π*
alternate full evaluation & greedy improvement
Interactive Visualisation
γ discount0.90Higher γ = more far-sighted · may need more iterations
Convergence metricStandard stopping criterion · guaranteed convergence to V*
Iteration
—
Max |ΔV|
—
V(start)
—
of N iters
—
3D Value Maptile height = V(s) · drag to rotate
⟳ Loading 3D scene…
Low VHigh V
Greedy Policy π(s)
arrow = best action · corner = V(s)
Convergence — Max |ΔV| per iteration
Key Takeaways
Value Iteration applies the Bellman optimality update repeatedly until values converge
Convergence is guaranteed — the Bellman operator is a contraction mapping
Tile height shows value: Goal area rises, Hole area sinks, path cells gradually lift
Higher γ makes the agent more far-sighted and may require more iterations
Once V* is found, the optimal policy π* is extracted in a single pass
Policy Iteration
Policy Iteration (PI) alternates between two steps: Policy Evaluation — computing the exact value of the current policy — and Policy Improvement — making the policy greedy with respect to those values. The process terminates when the policy no longer changes, guaranteeing the optimal policy π*.
Step 1 — Policy Evaluation (solve for Vπ, fixed policy)