Home
Lab · CartPole · PyTorch

Deep Reinforcement Learning

Authored by Zvika Berger.

Q-learning with a table

Pick a small environment, allocate one row per state and one column per action, and fill every cell with a real number — the agent's current estimate of the future return for taking that action in that state. Each transition only touches one cell: there is no generalisation between states.

Environment · 4-state corridor
s₀ s₁ s₂ s₃ start agent ● · goal · r=+1 LEFT RIGHT
4 states · 2 actions ⇒ Q[s][a] is a 4 × 2 table of floats. Step RIGHT in s₃ gives r=+1 and ends the episode; all other transitions give r=0.
Q-table · 4 × 2
stateLEFTRIGHT
s₀+0.00+0.59
s₁+0.53+0.66 ← update target
s₂+0.59+0.81
s₃+0.00+0.00
Every cell is one float. Greedy policy reads each row: π(s) = argmaxa Q[s][a].
Worked update · agent in s₁, takes RIGHT, lands in s₂ (r = 0)
# step 1 — compute target with the next state's best Q (off-policy max) y = r + γ · maxa′ Q(s₂, a′) = 0 + 0.9 · 0.81 = 0.729 # step 2 — TD error: how surprised are we? δ = y − Q(s₁, RIGHT) = 0.729 − 0.66 = +0.069 # step 3 — nudge the single cell Q(s₁, RIGHT) by α·δ (α = 0.1) Q(s₁, RIGHT) ← 0.66 + 0.1 · 0.069 = 0.667
Q-Learning agent — textbook pseudocode
function Q-Learning-Agent(percept) returns an action
inputs: percept, indicating the current state s and reward signal r static: Q, a table of action values indexed by state and action s, a, r, the previous state, action, and reward, initially null s′, a′, the next state and action Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s to start state Repeat (for each step of episode): Choose action a given s using policy derived from Q ← e.g. ε-greedy Take action a, observe reward r and next state s′ Q(s, a)Q(s, a) + α[ r + γ maxa′ Q(s′, a′)Q(s, a) ] ss′ until s is terminal
The Q-learning update rule
Bellman target & one-step TD update
# target y = r + γ · maxa′ Q(s′, a′) # temporal-difference (TD) error δ = y − Q(s, a) # tabular update (single cell, no other state changes) Q(s, a) ← Q(s, a) + α · δ
Plain English
"If the reward I got plus the value of where I ended up is bigger than what I expected, nudge Q(s,a) up by α-times the surprise. If smaller, nudge it down."
ε-greedy exploration
Pick an action — explore vs exploit
a = { random action with prob ε argmaxa Q(s, a) with prob 1 − ε }
Why we need exploration
Without it, the agent locks onto the first decent action it finds and never tries others. Common practice: start with ε=1.0 and decay to ~0.05 over episodes.
From tables to deep nets

Tabular Q-learning stores one cell Q(s,a) for every state–action pair. That breaks for two reasons:

Problem 1 — the table blows up
Real environments have astronomically many states (an Atari frame is 84×84 grayscale → 2567056 possible images). You cannot store, let alone visit, a row per state. And because every cell is independent, learning a good policy means visiting every (s, a) pair often enough to update each one of those |S|·|A| entries — a hopeless amount of experience.
Problem 2 — continuous values
Sensor readings — cart position, pole angle, joint velocities — are continuous real numbers. The state space is therefore uncountably infinite: between any two values there are infinitely many more. A table needs discrete keys, so any discretisation either loses precision (collapses similar states into the same row, throwing away the signal) or explodes the row count again as you refine the grid.
Example

Tabular Q-learning stores one number per (state, action) pair. That is fine for Grid World, but it crumbles the moment the state space becomes large or continuous.

Atari — huge discrete state
One state = one frame. With Atari's 210 × 160 pixels  ×  3 (RGB) the state vector is 210·160·3 = 100 800 bytes. The number of possible states is 256100 800 — astronomically many table rows.
CartPole — continuous state
The state includes the pole angle θ ∈ ℝ (and its velocity, cart position, cart velocity). A real-valued angle has distinct values — no table can index them without lossy discretisation.

DQN replaces the table with a deep neural network Qθ(s,a). The net takes the raw (continuous) state as input and generalises across nearby states, so we never need to enumerate them.

Tabular ⟶ Function Approximation
TABULAR a₀ a₁ a₂ s₀ s₁ s₂ 0.42 0.18 0.31 0.07 0.55 0.22 0.91 0.13 0.48 parameterise DEEP NET Qθ(s) state s Q(s, a₀) Q(s, a₁) Q(s, a₂)

The table has m×n independent floats. The network has only θ ≈ a few thousand floats — and they are shared across all states.

Generalisation
A deep net updates all similar states when it learns from one transition, because they share weights. That's the magic — and the danger: a bad gradient hurts everything at once.

A parametric Q-function Qθ(s, a) implemented as a small deep network — raw state in, one Q-value out per discrete action.

Loss
q(s, a) − q(s, a) = loss 𝔼[ Rt+1 + γ · maxa′ q(s′, a′) ] − 𝔼[ k=0 γk Rt+k+1 ] = loss

That target term q(s′, a′) is also produced by the same deep network — we just feed it the next state s′ instead of s and read off the Q-values for each action.

OpenAI Gym environments

Pick an environment. Each one is a little world with its own state, actions, and reward — the contract every RL algorithm has to satisfy. CartPole is the canonical sanity check; the others stress different parts of the algorithm.

env
x0.00
θ°0.00
return0
statusidle
State / Action / Reward
Episode ends when
Why CartPole?
Deep Q-Network (DQN)
Problem 1 — Non-i.i.d.
The data isn't independent & identically distributed — consecutive transitions are tightly correlated in time, so each gradient step pulls the net toward whatever just happened next.
Trick 1 — Experience Replay
Store every transition (s,a,r,s′) in a circular buffer D. Train on random mini-batches sampled from D. Breaks temporal correlation, reuses old data, gives i.i.d.-ish gradients.
Problem 2 — Non-stationarity
The target keeps changing its value — the same weights θ we're updating also define the target y = r + γ·maxa′ Qθ(s′,a′), so the net is chasing a moving goal.
Trick 2 — Target Network
Keep a frozen copy θ̄. Compute targets with it. Hard-copy θ → θ̄ every C steps. Stops the target from chasing itself.

Crucially, the network doesn't learn from each fresh step. At every environment step we just append the new transition (s, a, r, s′) to the buffer D. Then, every update_every steps (a hyperparameter), we draw a random mini-batch of size batch_size from D and run one gradient step on that batch — decoupling data collection from parameter updates.

Replay buffer (animation) — click Sample mini-batch to highlight a random subset
DQN algorithm
1. Initialize replay memory capacity. 2. Initialize the network with random weights. 3. For each episode: 1. Initialize the starting state. 2. For each time step: 1. Select an action. • Via exploration or exploitation 2. Execute selected action in an emulator. 3. Observe reward and next state. 4. Store experience in replay memory. 5. Sample random batch from replay memory. 6. Preprocess states from batch. 7. Pass batch of preprocessed states to policy network. 8. Calculate loss between output Q-values and target Q-values. • Requires a second pass to the network for the next state 9. Gradient descent updates weights in the policy network to minimize loss.
Lab · build & run your own DQN

Pick hyperparameters, simulate training, then run the resulting policy on CartPole. Sliders re-evaluate the curve instantly — but Run Policy uses whatever has been trained so far, so hit Simulate first.

episode
0
step
0
sub-step
0
in current episode
replay buffer · each cell shows one stored transition (st → st+1 · action arrow · reward) · blue = LEFT · purple = RIGHT · gold ring = sampled this step
step 0 · 0 / 10k stored (replay capacity from slider) · viz shows newest 30 · last batch sampled:
Qθ · online network — values update every 1 step(s)
θ ← θ − α·∇θL   step intensity scales with α
θ̄ · target network — frozen between flashes
frozen · θ̄ ← θ in 200 online updates
live hparams ▶ α=1e-3 γ=0.99 hidden=64 |D|=10k online_C=1 target_C=200 upd
Prioritised Experience Replay (PER)
The idea
  • Some experiences may be more important than others (but might occur less frequently).
  • While randomly & uniformly selecting batches of experiences, experiences that occur rarely have practically small chance to be selected.
  • PER tries to change the sampling distribution by using a criterion to define the priority of each tuple of experience.

Vanilla DQN samples a mini-batch uniformly from the replay buffer D. PER instead samples each transition i in proportion to its TD-error magnitude |δᵢ| — surprising transitions get replayed more often because they carry the most learning signal, while well-learned transitions are skipped most of the time.

What gets priority?
Take in priority experiences where there is a big difference between our prediction and the TD target, since it means we have a lot to learn about that transition. We use the absolute magnitude of the TD error |δᵢ|:
replay buffer:   ( st, at, rt, st+1, t| )
Each stored tuple now carries its TD-error magnitude alongside the usual (s, a, r, s′), so the sampler can read off priorities in O(log N) using a sum-tree.
Replay buffer with priorities — click Sample mini-batch to draw weighted by |δ|
PER sampling rule + bias correction
# TD-error magnitude — the "surprise" of transition i δi = r + γ · maxa′ θ̄(s′, a′) − Qθ(s, a) # Priority (ε keeps |δ|=0 transitions still reachable) pi = ( |δi| + ε )α # α=0 ⇒ uniform · α=1 ⇒ pure greedy # Sample mini-batch with probability ∝ priority P(i) = pi / k pk # Importance-sampling weight — corrects the bias from non-uniform sampling wi = ( N · P(i) )β # β anneals 0.4 → 1.0 over training
Why the importance-sampling weight?
Sampling with priority means high-|δ| transitions show up far more than they would under uniform sampling — so the gradient is biased toward them. Multiplying each transition's loss by wi down-weights those over-sampled transitions and recovers an unbiased gradient. β starts small (so we lean into prioritisation early) and anneals to 1 by the end of training.
Overestimation problem in Q-learning
Problem — maximisation bias breaks Q-learning
  • At the beginning of training, Q-values are noisy — we are never sure whether the action with the maximum estimated Q-value is really the best one.
  • The agent tends to take a non-optimal action in a given state simply because it happens to have the maximum (noisy) Q-value.
  • This is called the overestimation of action values — the max in the Bellman target systematically picks the most positively-biased estimate.
  • As a consequence, the learning process becomes very messy and slow: targets keep pulling the network toward inflated values that don't reflect the true return.
Concretely, the DQN target yDQN = r + γ · maxa′θ̄(s′, a′) uses the same noisy net to both pick the action and evaluate it — so any positive noise gets selected and propagated.
Double DQN (DDQN)

Vanilla DQN's target uses a single maxa′ over the same network that proposes the action. Whenever θ̄ has noisy estimates, that max systematically picks the most over-estimated action — so the targets are biased upward and the network learns inflated Q-values. Van Hasselt et al. (2015) fixed this by decoupling action selection from action evaluation.

Solution — use two different networks
Use the online net Qθ to select the next action, but the target net Q̂θ̄ to evaluate it. Their noise is (mostly) independent — noise that fools one net rarely fools the other, so the cross-check cancels much of the overestimation.
Online net  Qθ
s′ Qθ(s′,a)
selects  a* = argmaxa′ Qθ(s′, a′)
Target net  Q̂θ̄
s′ θ̄(s′, a*)
evaluates  y = r + γ · Q̂θ̄(s′, a*)
DQN vs Double DQN target
# Vanilla DQN — single net does both select & evaluate y = r + γ · maxa′ θ̄(s′, a′) # Double DQN — online net selects, target net evaluates a* = argmaxa′ Qθ(s′, a′) ← online net θ y = r + γ · θ̄(s′, a*) ← target net θ̄

Everything else — replay buffer, target-net copy schedule, ε-greedy exploration, gradient-step rule — is identical to DQN. The only change is which network produces a* inside the target, and that single line is enough to substantially reduce overestimation in practice.

Why it works — intuition
Bias from max only survives when the same estimator is used to pick and to score. Two different (but related) estimators with independent noise will disagree about which action is best on noisy frames, so the cross-evaluation lands closer to the true value. In the D3QN tab you can toggle Double DQN on/off in the lab and watch the learning curve change.
Dueling Double DQN — D3QN

D3QN (Dueling Double DQN) stacks two orthogonal fixes on top of vanilla DQN: a target rule from Double DQN that kills the maximisation bias, and a network architecture from Dueling DQN that learns the value of a state separately from how much each action improves on it.

Ingredient 1 — Double DQN (target)
online net picks a* = argmaxa′ Qθ(s′,a′); target net evaluates θ̄(s′, a*). Covered in tab 04 DDQN.
Ingredient 2 — Dueling head (architecture)
Replace the single-stream Q-head with two: a scalar V(s) and per-action A(s,a). Recombine into Q(s,a). Many states care little about which action is taken — sharing V(s) accelerates learning.
Dueling architecture

A shared trunk processes the state, then the network splits into V(s) and A(s,a) heads. They recombine via a mean-subtraction so the decomposition is identifiable.

VANILLA DQN DUELING ARCHITECTURE Q(s, a₁) Q(s, a₂) Q(s, a₃) Q(s, a₄) s ∈ ℝ⁴ FC · ReLU FC · ReLU Q-HEAD INPUT HIDDEN 1 HIDDEN 2 Q(s, a) Q(s, a) = head( trunk(s) ) ↓ argmax over outputs a* = best action V(s) ← scalar "how good is s" A(s, a₁) A(s, a₂) A(s, a₃) A(s, a₄) how much better is each action than average Q(s, a₁) Q(s, a₂) Q(s, a₃) Q(s, a₄) s ∈ ℝ⁴ FC · ReLU FC · ReLU VALUE HEAD Q-HEAD ADVANTAGE HEAD INPUT SHARED 1 SHARED 2 Q(s, a) = V(s) + ( A(s, a) − mean A(s, ·) ) ↓ argmax over outputs a* = best action
From basic decomposition to the stable Dueling combine
# Basic decomposition — value + advantage Q(s, a) = V(s) + A(s, a) # Definition of the advantage A(s, a) = Q(s, a) − V(s) # Dueling combine — mean-subtraction stabilises the decomposition Q(s, a) = V(s) + ( A(s, a) − meana′ A(s, a′) )
Why split V and A?

The key insight behind this architecture is that for many states it is unnecessary to estimate the value of each action choice.

Take the Atari game Enduro as an example. Knowing whether to move left or right only matters when a collision is imminent. In some states it is of paramount importance to know which action to take, but in many other states the choice of action has no repercussion on what happens.

For bootstrapping-based algorithms, however, the estimation of state value is of great importance for every state. Dueling lets the network learn V(s) directly — shared across all actions in that state — instead of waiting for each action's Q-value to converge independently.

D3QN vs the others — training curves
Episode return — DQN family
Same env, same hyper-params. D3QN (Double + Dueling) is more stable and converges faster than either ingredient on its own.
Lab · toggle Double & Dueling

Turn each ingredient on/off and watch the curve shift. Both flags on ⇒ D3QN. With Dueling enabled, the architecture diagram splits into V-head and A-head.

Playground · pick an algorithm, an env, or play yourself

Each algorithm is shown at its typical trained quality on CartPole — switch the env to see how the same agent fares elsewhere. Pick Manual to drive the env yourself with the keyboard.

All curves on one axis
Episode return — one curve per lab tab (03 → 06)
DQN → +PER → DDQN → DDDQN: each ingredient stacks more reward on top. Toggle datasets in the legend to declutter.
Hyper-parameters at a glance
Hyper-parameterSymbol
Learning rateα
Discountγ
Hidden widthh
Replay buffer size|D|
Mini-batch sizeB
Online update everyConline
Target update everyC
Algorithms covered in this lab
03 · DQN
Q-net + replay + target net. The breakthrough that made Deep RL practical.
value · off-policy · discrete actions
PER · prioritised replay
Prioritised Experience Replay — sample transitions in proportion to |δᵢ| with an importance-sampling weight to keep the gradient unbiased. Drops into any DQN-family loop.
value · off-policy · replay rule
DDQN · Double target
Decouples action selection (online net) from action evaluation (target net) in the Bellman target. Cures the maximisation bias of vanilla DQN.
value · off-policy · target rule
Dueling · V + A split
Architectural change only: the head splits into a scalar V(s) and per-action A(s,a), recombined as Q = V + (A − mean A). Speeds up learning when many states have action-independent value.
value · off-policy · architecture
DDDQN / D3QN · stack
DDQN's target rule + Dueling's head, stacked. The strongest lab variant — and with PER added on top it becomes Rainbow-lite.
value · off-policy · combined