Deep RL · Policy Optimisation

Actor-Critic & Policy Gradients

The natural sequel to the Deep RL lab. DQN learns a value and acts greedily — but what if the action space is continuous, or the optimal policy is stochastic? Meet REINFORCE, A2C, A3C, PPO, DDPG and SAC.

← Continues from Deep RL · DQN / PER / DDQN / DDDQN

DQN — a value network: state in, one Q-value per action out

state s → hidden · ReLU → Q(s,a) for every action action = argmax_a Q

This is the network the Deep RL lab ended on. The input (highlighted) is the state s; each output neuron is the estimated value Q(s,a) of taking that action. To act, DQN is deterministic: it simply takes the argmax — the action with the highest Q-value (gold ring). No sampling, no distribution — one greedy choice. That single design decision is exactly what breaks down on the next page.

DQN was great — until it wasn't

The Deep RL lab ended at DDDQN — a Q-network that learns Q(s,a) for every action and picks argmax_a Q(s,a). That works beautifully on Atari (18 discrete buttons) and Cartpole (push left / push right). But the moment we touch a robot arm, a self-driving car, or anything with a steering wheel and a throttle, DQN breaks.

The trouble is the argmax. With a discrete action set you can enumerate every action and pick the largest Q. With a continuous action — a torque in [-2, +2] Newton-metres, or a steering angle in [-30°, +30°] — there are infinitely many actions. You cannot enumerate them.

Inside π_θ — the policy network, end to end

state s → hidden · ReLU → π_θ(a|s) · softmax the policy network

It's the same MLP you used for Q(s,a) in the DQN lab — only the head changes. The input (highlighted) is the state s; the last layer emits raw logits that a softmax squeezes into a probability distribution where each output neuron is one π_θ(a_i|s) and they all sum to 1. We don't take an argmax — we sample an action from this distribution, which is exactly what lets the policy stay stochastic and differentiable.

Three places DQN cannot reach

∞

Continuous actions

Steering · torque · throttle

argmax over an infinite set is not tractable. You'd need a continuous optimiser inside every action call.

π(·|s)

Stochastic Policy

Rock-paper-scissors

A deterministic policy gets exploited. The optimal solution is a 1/3, 1/3, 1/3 mix — DQN cannot represent that.

∂π/∂θ

Direct optimisation

Skip the value

If you only care about acting well, why fit a whole Q-table? Differentiate the return through the policy and walk uphill.

π(·|s) 1 · Stochastic Policy — when two states look identical

Frozen Lake · perceptual aliasing

Suppose the agent can't see its absolute position — it only senses its neighbouring tiles. On this lake the two purple tiles have the exact same view (ice on the left, ice on the right): they are aliased. To the network they are the same state.

A deterministic policy maps that one view to one action. But the goal is to the right of the left tile and to the left of the right tile — opposite directions! Whichever it commits to, one twin skates straight into a hole, doomed with 100% certainty.

A stochastic policy goes left/right 50/50 from that view, so neither start is doomed — exploration carries it to the goal from both. Two bonuses fall out for free:

We get an exploration / exploitation trade-off baked into the policy — no ε-greedy schedule to hand-tune.
We get rid of the perceptual-aliasing problem: identical observations can still map to a distribution of actions.

Two agents start on the aliased (purple) tiles — identical view.

✊ ✋ ✌ A second face — Rock · Paper · Scissors

Stochastic Policy · the ⅓-⅓-⅓ Nash mix

Switch domains entirely: you're playing repeated Rock-Paper-Scissors against an opponent who watches your habits and best-responds. Any deterministic policy — “always ✊” — is instantly exploited: the opponent just plays ✋ forever and you lose every round.

The only unexploitable play is the mixed policy ⅓ ✊, ⅓ ✋, ⅓ ✌ — the game's Nash equilibrium. There is no single best action to argmax; the optimum is a distribution. A value-greedy DQN literally cannot represent that — a stochastic policy gets it for free.

Opponent best-responds to your move frequencies.

∂π/∂θ 2 · Direct optimisation — walk uphill on the return

Skip the value · gradient ascent on J(θ)

Value methods fit the whole Q-surface and then act greedily — a lot of machinery if all you really want is to act well. Policy gradient differentiates the expected return through the policy and simply steps: θ ← θ + α ∇_θ J(θ).

The heat-map is J(θ) over two policy parameters — brighter is higher return. The ball follows the gradient straight uphill to a peak. The honest catch (it's in the table just below): it converges to a local optimum, not always the global one. Hit new start θ a few times and watch it sometimes settle on the smaller hill.

θ starts somewhere random — press Ascend.

∞ 3 · Continuous actions — argmax has nothing to enumerate

Steering · torque · throttle

DQN acts by argmax_a Q(s,a): score every action, take the largest bar. Fine for 18 Atari buttons. But the front-wheel angle of a car lives in [−30°, +30°] — an infinite set. There is nothing finite to loop over.

Discretising into bins is the usual hack, and you can drag the slider to feel the trap: coarse bins miss the true optimum (red ring), fine bins find it but explode — k bins over d action-dims = kᵈ outputs. A policy instead emits the parameters of a distribution (a Gaussian's μ, σ) and reads the best action off directly — no search at all.

🧠 The fix — one neuron that outputs the angle directly

Deterministic continuous policy · a = μ(s) ∈ [−30°, +30°]

No bins, no argmax. The network keeps the same body but ends in a single output neuron; a tanh squashes it to [−1, 1], then we scale by 30 to get the front-wheel angle in [−30°, +30°]. The continuous action is read straight off the head — and because it's a smooth function of θ, we can push gradients through it.

Value-based vs. policy-based — at a glance

Property	Value-based (DQN family)	Policy-based (this lab)
Learns	Q(s,a)	π(a\|s)
Picks action via	argmax over Q	Sample from π
Discrete actions?	✅ Native	✅ Categorical π
Continuous actions?	❌ argmax intractable	✅ Gaussian / Tanh π
Stochastic policies?	❌ Always greedy	✅ Built-in

Policy Gradient (PG) — the general recipe

Policy Gradient (PG) is the whole family this lab is built on, resting on one idea: raise the log-probability of actions that paid off, lower the ones that didn't. Let's first pin down the general estimator and the training loop every method here inherits — then build its simplest concrete instance, REINFORCE, and its variance-reducing upgrade, the baseline, further down this page.

The new objective: maximise expected return

Policy-gradient objective

J(θ) = E [ R(τ | π_θ) ] // expected return under the policy ∇_θ J(θ) = E_τ [ ∇_θ log π_θ(a|s) · R(τ) ] // Policy Gradient Theorem θ ← θ + α · ∇_θ J(θ) // just gradient ascent (α = learning rate)

Instead of fitting a value table, we parameterise the policy directly as π_θ(a|s) — a neural network whose output is a probability over actions (or, for continuous control, the parameters of a distribution). We then compute the gradient of the expected return with respect to θ and walk uphill.

The magic is the log-likelihood trick: ∇ E[R] = E[∇ log π · R]. The right-hand side is a sample-able expectation — we just roll out the policy, multiply each log-prob by the return that followed, and that's an unbiased gradient estimate. No bootstrapping, no targets, no replay buffer.

The Policy Gradient Theorem — where the log comes from

We can't differentiate J(θ) directly, because θ sits inside the distribution we're averaging over, not inside the thing being averaged. The fix is one line of calculus — the log-derivative (score-function) trick:

Deriving the theorem

J(θ) = E [ R(τ | π_θ) ] = Σ_τ P(τ;θ) R(τ) // expected return = a P-weighted sum ∇_θ J(θ) = ∇_θ Σ_τ P(τ;θ) R(τ) // sum over trajectories τ = Σ_τ ∇_θ P(τ;θ) R(τ) // gradient of a sum = Σ_τ P(τ;θ) / P(τ;θ) · ∇_θ P(τ;θ) R(τ) // ×1 trick: multiply & divide by P = Σ_τ P(τ;θ) · ∇_θ P(τ;θ) / P(τ;θ) · R(τ) // regroup = Σ_τ P(τ;θ) ∇_θ log P(τ;θ) R(τ) // since ∇log P = ∇P / P = E_τ[ ∇_θ log P(τ;θ) · R(τ) ] // a P-weighted sum = an expectation and log P(τ;θ) = log p(s₀) + Σ_t [ log π_θ(a_t|s_t) + log p(s_t+1|s_t,a_t) ] so ∇_θ log P(τ;θ) = Σ_t ∇_θ log π_θ(a_t|s_t) // dynamics have no θ → they vanish! ⇒ ∇_θ J(θ) = E_τ[ Σ_t ∇_θ log π_θ(a_t|s_t) · R(τ) ] // Policy Gradient Theorem

Two things make this theorem so useful. First, the environment dynamics p(s_t+1|s_t,a_t) don't depend on θ, so when we take the log-gradient they drop out completely — we never need a model of the world. Second, the result is an expectation, so we estimate it by simply rolling out the policy and averaging ∇_θ log π_θ(a|s) weighted by the return R(τ).

Reading it intuitively: ∇_θ log π_θ(a|s) points in the direction that makes action a more likely; multiplying by R(τ) means good trajectories push their actions up and bad ones push theirs down. That is the entire idea behind every algorithm in this lab — REINFORCE is just this estimator with R(τ) = the Monte-Carlo return.

REINFORCE — Monte-Carlo policy gradient (Williams, 1992)

REINFORCE is the simplest possible algorithm that learns by walking up the policy-gradient hill. The recipe is just three lines:

📄 Original paper — Williams (1992), Simple statistical gradient-following algorithms for connectionist reinforcement learning.

REINFORCE — one episode

1. Roll out a whole episode with π_θ: τ = (s₀, a₀, r₀, s₁, a₁, r₁, …, s_T) 2. For every step t, compute the remaining-return: G_t = Σ_k=t..T γ^k-t · r_k 3. Take one gradient step: θ ← θ + α · Σ_t ∇_θ log π_θ(a_t|s_t) · G_t

Pseudocode · the full training loop

# ── INITIALISE ────────────────────────────────────────── initialize policy network πθ # random parameters θ for iteration = 1, 2, … : # …and repeat with the updated params # ── PLAY n GAMES (rollouts): sample, don't argmax ─── D ← [ ] for game in 1 … n: τ ← [ ] ; s ← env.reset() while not done: a ~ πθ(·|s) # sample an action from the policy s′, r, done ← env.step(a) τ.append( (s, a, r) ) ; s ← s′ D.append(τ) # ── LABEL every decision good / bad by its outcome ── for τ in D: for t in τ: G_t ← Σ_k≥t γ^(k−t) · r_k # return after t = the "label" # ── UPDATE: win → raise prob (↑), lose → lower (↓) ── J ← (1/n) Σ_τ Σ_t log πθ(a_t|s_t) · G_t # weight each log-prob by its return θ ← θ + α · ∇θ J # one gradient-ascent step on J(θ)

πθ used sample / score actionθ updated gradient ascent▲ new in REINFORCE weight ∇log π by the Monte-Carlo return G_t

So what is G_t? It's the return-to-go (a.k.a. remaining return): the total discounted reward collected from step t to the end of the episode. It's the score we hang on the action a_t — "given that I took this action here, how did the rest of the episode actually turn out?"

G_t = r_t + γ r_t+1 + γ² r_t+2 + … + γ^T−t r_T

Two things to notice. The discount γ ∈ [0, 1) makes reward that arrives sooner count for more than far-future reward. And G_t sums only rewards from t onward — an action is never credited for reward that was already banked before it was taken (that's the "reward-to-go" idea).

In the update, G_t is simply the weight on each action's log-prob gradient: a big positive G_t shoves π_θ(a_t|s_t) up, a negative one shoves it down. Because it's the actual sampled return — not an estimate — it's unbiased, but it swallows every random event for the rest of the episode, which is exactly why it's so noisy (the variance problem below).

Learning curve — and why it's noisy

This is a simulated REINFORCE training run on Lunar Lander — return per episode (light) and a 50-episode running mean (bold). Notice the massive episode-to-episode variance: a single unlucky rollout can drop the return by 200. That noise is precisely what the next tab fights.

REINFORCE-with-baseline — the algorithm

Two networks: actor & baseline

actor: π_θ(a|s) // outputs action probabilities baseline: V_φ(s) // scalar value estimate 1. Roll out an episode. 2. Compute G_t for every step. 3. Compute advantage: A_t = G_t − V_φ(s_t) 4. Update actor: θ ← θ + α_π · ∇_θ log π(a_t|s_t) · A_t 5. Update baseline: φ ← φ + α_V · ∇_φ ( G_t − V_φ(s_t) )² // MSE regression to G_t

The baseline trick — keep the bias, kill the variance

Here is a beautiful mathematical fact: for any function b(s) that depends only on the state (not the action), subtracting it from the return inside the gradient is free — the expected gradient is unchanged.

Why the baseline is unbiased

E [ ∇ log π(a|s) · b(s) ] = b(s) · E [ ∇ log π(a|s) ] = b(s) · ∇ E [ 1 ] = b(s) · ∇ 1 = 0 ⇒ ∇ J = E [ ∇ log π(a|s) · ( G_t − b(s_t) ) ] // same gradient, lower variance

So we can subtract anything that's a function of state. The best choice — the one that minimises variance — is the state-value function V(s). That gives us the advantage.

Before vs. after — same task, less noise

Red = vanilla REINFORCE. Green = REINFORCE-with-baseline. Same final return, same number of episodes — but the green curve is dramatically smoother. That smoothness translates directly into being able to use a larger learning rate, which translates into faster real-world learning.

From REINFORCE to Actor-Critic

The policy gradient is the same one REINFORCE used — every Actor-Critic method just changes the weight on ∇_θ log π. REINFORCE weights it by the noisy Monte-Carlo return G_t; Actor-Critic swaps that for a bootstrapped estimate built from the critic's value V(s).

The one change Actor-Critic makes

REINFORCE: ∇_θ J(θ) = E_τ [ ∇_θ log π_θ(a|s) · G_t ] // full Monte-Carlo return G_t ──swap──▶ V(s) // use the critic's value V(s) Actor-Critic: ∇_θ J(θ) = E_τ [ ∇_θ log π_θ(a|s) · V(s) ] // just the critic's value V(s)

Combining policy and value learning

Actor π_θ(a|s)

Outputs probabilities over actions (or a distribution mean/std for continuous). Updated by the policy-gradient with the critic's advantage.

Critic V_φ(s)

Outputs a single scalar — the expected return from this state. Updated by TD regression to r + γ V(s').

The actor picks the action. The critic tells the actor whether the action was better or worse than expected. The actor takes a tiny gradient step accordingly. Round trip: ~5 ms. No full episode required.

Actor-Critic stitches the two families together: the actor is a policy-based learner, the critic is a value-based learner, and each one patches the other's weakness.

The actor is the policy π_θ(a|s) — it selects actions and is what we ultimately deploy. The critic learns a value function V_φ(s) (or Q_φ(s,a)) — it never picks actions; it just critiques the actor's choices with evaluative feedback.

The critic's signal: the TD error

actor: θ ← θ + α ∇_θ log π_θ(a_t|s_t) · V_φ(s_t) // nudge policy by the critique critic: φ ← φ + α δ_t ∇_φ V_φ(s_t) // shrink its own TD error (δ²)

Instead of waiting for the noisy Monte-Carlo return G_t, the actor reads the critic's value V(s) — its estimate of how good the current state is — and nudges its policy with that critique instead of a whole episode's return. Meanwhile the critic keeps fitting V(s) more accurately, so the signal the actor learns from gets sharper as training goes on.

That bargain buys three things at once: lower variance (the critic's bootstrapped estimate is far steadier than a whole-episode return), faster learning (we update every step instead of every episode), and a natural fit for continuous action spaces — exactly the limits of purely policy-based or purely value-based methods.

A2C (Advantage Actor-Critic)

📄 Original paper — Konda & Tsitsiklis (2000), Actor-Critic Algorithms.

The one change Actor-Critic makes

Actor-Critic: ∇_θ J(θ) = E_τ [ ∇_θ log π_θ(a|s) · V(s) ] // just the critic's value V(s) A_t = Q(s,a) − V(s) // the advantage with Q(s,a) ──swap──▶ r + γ V(s′) // bootstrap the action-value A2C: ∇_θ J(θ) = E_τ [ ∇_θ log π_θ(a|s) · A_t ] // A_t = the value-based advantage above

A2C makes one more upgrade: it replaces the bare value V(s) with the advantage A_t = Q(s,a) − V(s). Where V(s) only says "how good is this state on average?", A_t asks the sharper question: "how much better (or worse) was this action than the critic's average expectation for the state?"

That centring is what makes the gradient informative — actions that beat the baseline V(s) get pushed up, actions that fall short get pushed down — while keeping variance low.

Actor-Critic — the leap to bootstrapping

TD-style advantage

TD advantage: A_t = r_t + γ · V(s_t+1) − V(s_t) actor: θ ← θ + α_π · ∇_θ log π(a_t|s_t) · A_t critic: φ ← φ + α_V · ∇_φ ( r_t + γ V(s_t+1) − V(s_t) )²

One-step Actor-Critic — pseudocode

The two updates happen every transition. The critic lines fit the value function; the actor line nudges the policy using the advantage A_t — which, for one-step Actor-Critic, is the TD error r + γV(s′) − V(s).

initialize actor πθ and critic Vφ # random θ, φ for each episode: s ← env.reset() while not done: a ~ πθ(·|s) # ACTOR selects an action s′, r, done ← env.step(a) # ── CRITIC: evaluate → advantage (TD error) ── A_t ← r + γ Vφ(s′) − Vφ(s) # advantage = TD error φ ← φ + α_V · A_t ∇φ Vφ(s) # ← UPDATE CRITIC (fit value) # ── ACTOR: improve policy with the critique ── θ ← θ + α_π · ∇θ log πθ(a|s) · A_t # ← UPDATE ACTOR (raise/lower π) s ← s′ # online: one step per transition

network used (forward) actor update θ (the policy) critic update φ (the value) ▲ new in AC: bootstrapped TD advantage

Step through the algorithm

episode0 step0

initialize actor πθ and critic Vφ # random θ, φ for each episode: s ← env.reset() while not done: a ~ πθ(·|s) # sample an action s′, r, done ← env.step(a) # ── CRITIC: advantage, then fit value ── Aₜ ← r + γ Vφ(s′) − Vφ(s) # advantage φ ← φ + α_V · Aₜ ∇φ Vφ(s) # UPDATE CRITIC # ── ACTOR: improve the policy ── θ ← θ + α_π · ∇θ log πθ(a|s) · Aₜ # UPDATE ACTOR s ← s′ # loop back

s state—

a action—

r reward—

s′ next—

Aₜ advantage—

V(s) critic—

π L / R—

Press ▶ Step to walk the algorithm line by line — watch the variables update.

Why a stochastic policy wins — games a deterministic policy can't

A2C's actor is a stochastic policy π(a|s): it outputs a distribution over actions and samples — unlike a value-greedy or deterministic policy a = μ(s) that always picks one fixed action. Why does that matter? When the best behaviour is to be unpredictable, or when many states look identical, a deterministic policy can be read, exploited, or trapped. Each game below pits a deterministic policy against a stochastic one — watch deterministic lose. Representing a distribution over actions is exactly the superpower a policy-gradient method like A2C gives you.

Opponent

your move → Round0 Your score0

Now you play against the model. Pick the Deterministic opponent — it always throws ✊, so just play ✋ every time and you win every round. A fixed policy is a habit, and habits get exploited.

Switch to the Stochastic opponent (⅓ ✊, ⅓ ✋, ⅓ ✌ — the Nash equilibrium) and try again: there's no pattern to punish, so however you play your score just hovers around zero — unexploitable. The optimum here is a distribution, something a = μ(s) can't represent. (▶ Auto-play me plays the best counter for you.)

A3C — Asynchronous Advantage Actor-Critic

📄 Original paper — Mnih et al. (2016), Asynchronous Methods for Deep Reinforcement Learning.

Before A2C, the same group published A3C — the same loss, but each worker runs its own copy of the env on its own CPU thread and pushes gradients to a shared parameter server without waiting. The result feels like SGD on a chaotic mini-batch — and somehow, it works.

One worker · one inner loop

loop: θ' ← global.θ // ⇣ pull: local net updated from global collect n steps with π_θ' compute A_t for each step compute ∇L (actor + critic + entropy) // no lock! no average! just shoot it at the server: global.θ ← global.θ + α · ∇L

local net used π_θ' rolloutlocal net updated θ' ← global.θ (pull)▲ new in A3C asynchronous, lock-free push of ∇L into the shared global net

Why no locks doesn't break

Two workers pushing gradients at the same time will sometimes overwrite each other. This is called Hogwild! updating, and it provably converges as long as the updates are sparse-enough. In practice the workers are also de-synchronised — they're at different points in different episodes — so their gradients are diverse, which helps reduce correlation.

The big advantage of A3C over single-thread methods at the time was not faster gradients per second — it was that the diversity of trajectories acted like a replay buffer would in DQN, decorrelating updates and removing the need for one.

How A3C learns — one brain, many hands

A3C is built around a single global network holding the shared parameters (θ for the actor, φ for the critic). Around it run many independent workers, each with its own copy of the networks and its own environment instance — so exploration happens in parallel across CPU threads.

What each worker computes (n-step)

n-step return: R_t = Σ_k=0..n-1 γ^k r_t+k+1 + γⁿ V_φ(s_t+n) advantage: A_t = R_t − V_φ(s_t) actor loss: L_π = − log π_θ(a_t|s_t) · A_t − β · H[π] // entropy bonus β critic loss: L_V = ( R_t − V_φ(s_t) )²

① Pull

each worker copies the latest global θ, φ into its local nets, then rolls out n steps in its own env.

② Push (async)

it computes ∇L locally and fires it straight at the global network through a shared optimiser (RMSProp/Adam) — no lock, no waiting (Hogwild!).

③ Decorrelate

workers sit at slightly different policy versions, so their experience is diverse — this replaces the replay buffer and stabilises training.

Adapted from APXML · Asynchronous Advantage Actor-Critic (A3C).

Global ⇄ workers — actor & critic nets with live neuron values

workers: global updates0

Each component holds two tiny networks: a purple actor π_θ (state → action probs) and a teal critic V_φ (state → value). The number on every neuron is its live activation from a real forward pass of the same input state s (top-left). Top = the GLOBAL nets; each worker has its own local copy. Use ▷ Step to advance one rollout tick at a time — each click, every worker takes a step (its bar fills); when a bar completes the worker pushes its gradient ⇡ into the global net (neurons shift and flash gold), bumps its episode counter, and syncs ⇣ the fresh global weights back. Each worker's progress bar shows its episode count and rollout %; the workers run desynchronised, so they complete at different ticks. Hit ▶ Play to auto-step.

PPO — the workhorse (Schulman et al., 2017)

📄 Original paper — Schulman et al. (2017), Proximal Policy Optimization Algorithms.

Vanilla policy gradient has a vicious failure mode. If a single update is too large, the policy can shift so far that the next batch of rollouts comes from a totally different distribution — and the gradient estimate from that batch becomes useless. The policy collapses, and you have to start over.

Trust-Region Policy Optimisation (TRPO) solved this with a second-order constraint on the KL divergence between old and new policies. Beautiful, but heavy. PPO is the same idea, but simple enough to fit on a napkin.

PPO-clip objective

ratio: r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t) surrogate: L₁ = r_t(θ) · A_t L₂ = clip( r_t(θ), 1−ε, 1+ε ) · A_t PPO-clip: L^CLIP = E [ min( L₁, L₂ ) ] ε ≈ 0.2 ⇒ "no update is allowed to move the ratio more than 20%"

Two ideas, that's the whole trick:

ratio r_t — instead of A2C's raw log π, PPO measures how much the new policy differs from the old one that collected the batch: r_t = π_θ / π_{θ_old}. r_t = 1 means "unchanged"; r_t > 1 means the new policy makes that action more likely.
clip — the surrogate r_t·A_t would happily push r_t far from 1 on a single good batch and blow up the policy. So PPO also computes the clipped version (r_t pinned to [1−ε, 1+ε]) and keeps the min of the two. Once a step has moved the ratio past ±ε, the clip flattens the gradient — there's no more reward for moving further. That's the trust region, enforced with one min.

What PPO changes from A2C

PPO keeps A2C's whole backbone — a stochastic actor, a value critic, GAE advantages, parallel envs, the entropy bonus. It changes three things, and those three are what turn A2C into the on-policy default.

r_t

① Probability ratio

Importance weight

A2C: ∇log π·A on fresh data. PPO uses r_t = π_θ/π_{θ_old} against the policy that collected the batch — so it can train on slightly-old data.

clip

② Clipped trust region

Bounded step

A2C's one unbounded step can collapse. PPO clips r_t to [1−ε, 1+ε] and keeps the min — no update moves more than ~ε. TRPO's guarantee, first-order.

×10

③ Multi-epoch reuse

Reuse the batch

A2C: 1 update/batch, then discard. PPO: ~4–10 epochs of minibatch SGD per batch — safe because the clip keeps π_θ near π_{θ_old}. ~10× learning per env step.

PPO's two networks — what goes in, what comes out

PPO is an actor-critic method: one network to act, one to judge. In code they usually share a torso and split into two heads, but conceptually they are two distinct functions — here is the name, input and output of each.

Actor — π_θ(a|s)

the policy we deploy

INPUTstate s — the observation vector (e.g. the 8 Lunar-Lander sensors)

OUTPUTa distribution over actions — softmax π(a|s) for discrete, or Gaussian μ(s), σ(s) for continuous. We sample the action from it (never argmax).

Critic — V_φ(s)

the value baseline for the advantage

INPUTstate s — the same observation vector the actor sees

OUTPUTone scalar V(s) — expected return from s. Feeds the GAE advantage Â_t, which weights every policy update.

Clip Noise

Heads-up — a different clip. PPO clips the probability ratio r = π/π_old; a sibling continuous-control method, TD3, clips Gaussian noise instead — same word, a very different job. Here is that other clip, up close. Target policy smoothing replaces the single target action with a small noisy neighbourhood around it: ã = μ_θ'(s′) + clip( 𝒩(0, σ̃), −c, +c ). Two ideas stacked: a Gaussian for the noise, and a clip to bound it. The graph shows both — drag the sliders to see how the clip folds the Gaussian's long tails back onto ±c.

noise σ̃ = 0.20 clip c = 0.50 noise clamped to ±c—

The blue bell is the raw Gaussian noise; the teal area is the part kept unchanged; the coral spikes at ±c are the long tails folded back onto the boundary by the clip. Shrink c or grow σ̃ and watch more probability pile up at the edges.

Why add noise at all?

The critic is trained to fit Q at exactly the actor's target action. If that Q-surface has a sharp, spurious peak (function-approximation error), the deterministic actor steers straight into it. Sampling a little noise around the target action and averaging turns the target into a smooth local average of Q — a SARSA-like regulariser, so the policy can't exploit a one-pixel spike.

Why clip it?

A raw Gaussian has unbounded tails — every so often it would throw the target action far from μ(s′), into a totally different action regime and poison the target with nonsense. Clipping to [−c, +c] keeps the smoothing local and controlled: average over a small neighbourhood, never over wild outliers. Smooth — but only a little.

What does clipping actually do?

The horizontal axis is the ratio r = π/π_old. The vertical axis is the PPO objective for one (s, a, A) sample.

When A > 0 (good action): the loss rises until r=1+ε, then flattens. Even if the network could make the action much more probable, PPO refuses to reward it.

When A < 0 (bad action): the loss falls until r=1−ε, then flattens. PPO refuses to push the action probability below (1−ε)·π_old.

The min ensures that the pessimistic bound applies — if the network tried to be optimistic, PPO clips it. The net effect: every update step is contained in a "trust region" of ratio space.

Multiple epochs per batch — the second PPO superpower

PPO training loop

1. Roll out N steps with π_{θ_old} on K parallel envs. 2. Compute advantages (typically with GAE). 3. For 4–10 epochs, mini-batch the rollout and step the loss. 4. π_{θ_old} ← π_θ. Resample.

A2C does one gradient step per batch of rollouts. PPO does 10. Because the clip keeps the new π close to π_old, the rollouts stay roughly valid for the next epoch, and we squeeze 10× more learning out of every batch of expensive environment interactions. This is the real reason PPO is the modern default.

PPO — full pseudocode

Pseudocode · collect → GAE → optimise for E epochs

# ── INITIALISE ────────────────────────────────────────── initialize actor πθ and critic Vφ # random θ, φ (often a shared torso + 2 heads) πθ_old ← πθ for iteration = 1, 2, … : # ── COLLECT: roll out the frozen πθ_old on K parallel envs ── D ← [ ] for N steps (× K envs): a ~ πθ_old(·|s) # sample; store logπ_old(a|s) and Vφ(s) s′, r, done ← env.step(a) D.append( (s, a, r, logπ_old, Vφ(s)) ) ; s ← s′ # ── ADVANTAGE: GAE-λ straight from the critic ────────── δ_t ← r_t + γ·Vφ(s_t+1) − Vφ(s_t) # one-step TD error Â_t ← Σ_l≥0 (γλ)^l · δ_t+l # generalised advantage estimate Ĝ_t ← Â_t + Vφ(s_t) # regression target for the critic # ── OPTIMISE: reuse the SAME batch for 4–10 epochs ───── for epoch = 1 … E: # ▲ new in PPO: reuse one batch E times for minibatch in D: r_t(θ) ← πθ(a_t|s_t) / πθ_old(a_t|s_t) # probability ratio L_clip ← min( r_t·Â_t, clip(r_t,1−ε,1+ε)·Â_t ) # ▲ new in PPO: clipped trust region L_V ← ( Vφ(s_t) − Ĝ_t )² # critic MSE (↓) L ← −L_clip + c₁·L_V − c₂·H[πθ(·|s_t)] # +entropy bonus for exploration θ, φ ← Adam step on ∇L πθ_old ← πθ # freeze a fresh copy → resample next batch

nets used πθ, πθ_old, Vφ (forward)θ,φ updated Adam on ∇L▲ new in PPO clipped surrogate + multi-epoch batch reuse

Two loops make PPO PPO. The outer loop collects a fresh batch and re-freezes πθ_old. The inner loop reuses that one batch for several epochs — safe only because the clip guarantees πθ never drifts far from πθ_old, so the off-policy ratio r_t stays well-behaved. The actor climbs L^CLIP, the critic regresses to Ĝ_t, and the entropy term keeps the distribution from collapsing too early.

PPO vs A2C — learning curves

Simulated learning curves on a Lunar-Lander-class env. Both algorithms reach the same final return — but PPO does it in ~⅓ the environment steps. The clip prevents catastrophic updates, the multiple epochs squeeze more juice from each batch, and the entropy bonus + GAE keep exploration alive throughout.

PPO in ChatGPT — the RL behind RLHF

PPO isn't just for robots and games — it's the algorithm that aligned ChatGPT. After a language model is pretrained and supervised-fine-tuned, the final polish is RLHF — Reinforcement Learning from Human Feedback — and the optimiser at its heart is PPO (OpenAI's InstructGPT → ChatGPT recipe).

The three-stage RLHF pipeline

1 · SFT

Supervised fine-tune the pretrained LM on human-written demonstrations of good answers. This becomes the starting policy π_ref.

2 · Reward model

Humans rank several model answers to the same prompt; train a reward model r_φ to predict which answer a human prefers — a learned, automatic judge.

3 · PPO

Fine-tune the LM with PPO to maximise that reward model's score — pushing the policy toward answers humans rate highly.

How generating text becomes an RL problem

RL concept	…in ChatGPT's RLHF
Policy π_θ (actor)	the language model itself — it "acts" by emitting tokens
State `s`	the prompt + every token generated so far
Action `a`	the next token sampled from the vocabulary (a ~50k-way discrete action)
Episode / trajectory	generating one full response, token by token
Reward `r`	the reward model's score of the finished answer — minus a per-token KL penalty
Critic V_φ	a value head on the LM estimating expected reward (feeds the advantage / GAE)

RLHF objective — PPO on a KL leash

maximise E [ r_φ(prompt, answer) ] − β · KL( π_θ ‖ π_ref ) └─ reward model ─┘ └─ stay close to the SFT model ─┘

Why PPO here — and exactly when it runs

When: the final training stage, after pretraining and SFT. This is the alignment step that turns a raw next-token predictor into an assistant that follows instructions, refuses harmful requests, and sounds helpful.

Why PPO and not vanilla policy gradient: the clip plus the explicit KL penalty keep the fine-tuned model from drifting far from the SFT model — without that leash the policy "reward-hacks" the reward model and collapses into repetitive gibberish that scores high but reads terribly. PPO's multi-epoch reuse also squeezes maximum learning from each batch of expensive reward-model-scored generations, and it stays stable at billion-parameter scale.

Note: newer alignment methods (e.g. DPO) skip the explicit PPO loop by optimising the preference data directly — but PPO was the original recipe behind InstructGPT and ChatGPT, and is still widely used for RLHF.

DDPG — Deep Deterministic Policy Gradient (Lillicrap et al., 2015)

📄 Original paper — Lillicrap et al. (2015), Continuous Control with Deep Reinforcement Learning.

Everything so far was on-policy: roll out, learn, throw away. That's expensive when each environment step costs minutes (real robots, simulators). For continuous action spaces, can we go back to the DQN-style sample-efficient world of replay buffers and target networks?

DDPG says yes. The trick is to learn a deterministic policy μ_θ(s) = a — no distribution, just a function from state to a chosen action. Then the Q-function's gradient w.r.t. the action becomes the policy's gradient.

DDPG's two networks — what goes in, what comes out

DDPG is actor-critic for continuous control. The crucial change from a stochastic policy: the actor is deterministic — it outputs one action, not a distribution. And like DQN, the critic scores a state-action pair, so its input includes the action.

Actor — μ_θ(s)

deterministic policy

INPUTstate s — the observation vector

OUTPUTone deterministic action a = μ(s) (e.g. a tanh-scaled torque). No distribution; exploration noise is added on top at act-time.

Critic — Q_φ(s,a)

action-value, DQN-style

INPUTstate s and action a, concatenated

OUTPUTone scalar Q(s,a). Its gradient ∇_aQ is what pushes the actor (chain rule) toward better actions.

Deterministic policy gradient

networks: μ_θ(s) → a // deterministic actor Q_φ(s, a) → R // critic critic loss: L_Q = E [ ( r + γ Q_φ'(s', μ_θ'(s')) − Q_φ(s,a) )² ] // DQN-style actor loss: ∇_θ J = E [ ∇_θ μ(s) · ∇_a Q(s,a) |_a=μ(s) ] // chain rule!

DDPG's three methods

Because the policy is deterministic and the updates use bootstrapped Q-targets, the data doesn't have to come from the current policy — so DDPG can borrow three off-policy stabilisers. Tap each card for how it works.

↺

Replay buffer

Reuse old transitions

Keep a giant buffer of past (s, a, r, s′) and sample random mini-batches — off-policy, so every transition trains the nets many times.

τ·θ

Target networks

Slow-moving copies

Slowly-updated copies of both actor and critic give a stable regression target. Polyak averaging, not hard copies — τ ≈ 0.005 per step.

𝒩(0,σ)

Exploration noise

a = μθ(s) + 𝒩(0,σ)

A deterministic policy has no built-in randomness, so DDPG adds external noise (Ornstein-Uhlenbeck or Gaussian) — during training only; at eval, use the deterministic action.

DDPG — full pseudocode

Pseudocode · off-policy, one update per environment step

# ── INITIALISE ────────────────────────────────────────── initialize actor μθ and critic Q_φ # random θ, φ θ′ ← θ ; φ′ ← φ # target nets = exact copies replay buffer 𝓓 ← ∅ for each environment step: # ── ACT with exploration noise & STORE ───────────── a ← μθ(s) + 𝒩(0, σ) # ▲ new in DDPG: deterministic action + Gaussian exploration noise s′, r, done ← env.step(a) 𝓓.append( (s, a, r, s′, done) ) ; s ← s′ # ── LEARN from a random minibatch ────────────────── sample B = {(s, a, r, s′)} ~ 𝓓 y ← r + γ · Q_φ′(s′, μθ′(s′)) # bootstrapped target (uses target nets) L_Q ← ( Q_φ(s,a) − y )² # critic regression (↓) φ ← Adam(∇L_Q) # actor: push the action toward higher Q (chain rule) L_μ ← −Q_φ(s, μθ(s)) # ▲ new in DDPG: deterministic policy gradient ∇θμ·∇aQ θ ← Adam(∇L_μ) # ── Polyak soft-update of BOTH target nets ───────── φ′ ← τ·φ + (1−τ)·φ′ ; θ′ ← τ·θ + (1−τ)·θ′ # τ ≈ 0.005

nets used μθ, Q_φ + targets (forward)θ,φ updated Adam · Polyak targets▲ new in DDPG deterministic actor trained through the critic (chain rule) + replay/target reuse

The whole loop runs off-policy: every environment step appends one transition to the replay buffer and does one update sampled from the entire history. The critic regresses to a bootstrapped target built from the target networks (the slow copies θ′, φ′), and the actor simply walks uphill on Q_φ(s, μθ(s)) — the chain rule turns ∇_aQ into a gradient on θ. The fragile part is the 𝒩(0,σ) exploration noise bolted onto the action — exactly what SAC replaces with built-in entropy.

Watch DDPG learn — continuous-control tasks

Continuous actions are everywhere: a spaceship's thrusters, a steering wheel + throttle, a push force, a joint torque. Pick a task in the tabs below and an episode budget, and watch the deterministic actor μ(s) go from flailing to fluent — the same algorithm, just a different continuous output.

episodes trained ep0 Step0 Score0

DDPG's two networks, live this step: the actor μ(s) maps the state → the action; the critic scores the pair as Q(s,a), so its input is the state and the action. Inputs are colour-coded — blue = state s, gold = action a. Use ⏭ Step to advance one step and read the numbers off.

TD3 — Twin Delayed DDPG (Fujimoto et al., 2018)

📄 Original paper — Fujimoto, van Hoof & Meger (2018), Addressing Function Approximation Error in Actor-Critic Methods.

DDPG is powerful but brittle: its single critic systematically over-estimates Q-values, the actor then exploits those bogus peaks, and training collapses. TD3 keeps DDPG's deterministic-actor backbone and fixes it with three small, surgical tricks — it's "DDPG done right".

TD3's networks — what goes in, what comes out

TD3 keeps DDPG's deterministic actor but doubles the critic: two identical Q-nets (plus target copies of all three). The actor's input/output are unchanged from DDPG — the fix lives entirely in the critics, whose min kills the overestimation bias.

Actor — μ_θ(s)

deterministic policy (same as DDPG)

INPUTstate s — the observation vector

OUTPUTone deterministic action a = μ(s). Updated only every d steps (delayed); at the target, clipped noise is added for smoothing.

Twin Critics — Q_φ1, Q_φ2(s,a) ×2

two Q-nets; target uses their min

INPUTstate s and action a, concatenated

OUTPUTone scalar Q(s,a) per net. The TD target takes min(Q₁, Q₂) to under-shoot and cancel optimism bias.

The TD3 target & updates

target action: ã = μ_θ'(s') + clip(ε, −c, +c), ε ~ 𝒩(0, σ) // ① policy smoothing target value: y = r + γ · min( Q_φ1'(s', ã), Q_φ2'(s', ã) ) // ② clipped double-Q critics: minimise ( Q_φi(s,a) − y )² for i = 1, 2 actor: θ ← θ + α ∇_θ Q_φ1(s, μ_θ(s)) every d steps // ③ delayed update

Three tricks, one stable algorithm

min(Q₁,Q₂)

① Twin double-Q

Clipped double-Q

Two critics; the TD target takes the smaller one. Under-shooting on purpose cancels DDPG's optimism bias so the actor can't chase phantom value.

every d

② Delayed updates

Let the value settle

Update actor & targets only every d ≈ 2 critic steps. ⚠ So TD3 keeps 6 nets: actor + twin critics + a target copy of each (3 live + 3 targets).

𝒩 → ±c

③ Clipped noise

Local exploration

Add Gaussian noise to the action, clipped to ±c — try nearby actions, never wander far. Bounded enough to find better actions, never into nonsense.

TD3 — full pseudocode

Pseudocode · DDPG + the three tricks (① smoothing ② twin-min ③ delay)

# ── INITIALISE ────────────────────────────────────────── initialize actor μθ, critics Q_φ1, Q_φ2 # random θ, φ1, φ2 θ′ ← θ ; φ1′ ← φ1 ; φ2′ ← φ2 # target nets = exact copies replay buffer 𝓓 ← ∅ ; t ← 0 for each environment step: a ← μθ(s) + 𝒩(0, σ) # explore s′, r, done ← env.step(a) 𝓓.append( (s, a, r, s′, done) ) ; s ← s′ ; t ← t + 1 sample B = {(s, a, r, s′)} ~ 𝓓 # ① target policy smoothing — clipped noise on target action ã ← μθ′(s′) + clip( 𝒩(0, σ̃), −c, +c ) # ▲ trick ① # ② clipped double-Q — take the MIN of the two target critics y ← r + γ · min( Q_φ1′(s′,ã), Q_φ2′(s′,ã) ) # ▲ trick ② L_Qi ← ( Q_φi(s,a) − y )² for i=1,2 # update BOTH critics EVERY step (↓) φ1, φ2 ← Adam(∇L_Qi) # ③ delayed updates — actor & targets only every d steps if t mod d == 0: # ▲ trick ③ (delay) L_μ ← −Q_φ1(s, μθ(s)) # actor follows critic 1 only θ ← Adam(∇L_μ) φ1′ ← τ·φ1 + (1−τ)·φ1′ # Polyak soft-update φ2′ ← τ·φ2 + (1−τ)·φ2′ θ′ ← τ·θ + (1−τ)·θ′

nets used μθ, Q_φ1, Q_φ2 + targetsθ,φ updated Adam · Polyak▲ new in TD3 ① target smoothing ② clipped double-Q (min) ③ delayed actor/target updates

Line for line it is DDPG — with three surgical additions. ① The target action gets clipped noise so the critic can't latch onto a sharp spurious peak. ② The TD target uses min(Q₁,Q₂), deliberately under-shooting to cancel DDPG's optimism. ③ The actor and all target nets update only every d (≈2) critic steps, letting the value estimate settle before the policy chases it. Everything else — replay buffer, bootstrapped targets, Polyak averaging — is inherited unchanged.

SAC — Soft Actor-Critic (Haarnoja et al., 2018)

📄 Original paper — Haarnoja et al. (2018), Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor.

DDPG works but is famously unstable — a bad seed can ruin a run. The diagnosis: exploration is bolted on as external noise, fighting the deterministic policy. SAC takes a different approach: build exploration into the objective itself.

The maximum-entropy RL objective

classic: J(π) = E [ Σ γ^t · r_t ] max-ent: J(π) = E [ Σ γ^t · ( r_t + α · H[π(·|s_t)] ) ] └──── entropy bonus, every step ────┘ "act as well as possible — and remain as random as possible while doing it"

SAC's key ideas

α·H[π]

Entropy bonus

Exploration, by design

Reward = return + α·H[π]. Paying the agent for being random (high entropy) keeps it exploring and stops σ collapsing to 0 — exploration is built into the objective, not bolted on.

2 × Q

Twin critics

Take the min

Learn two Q-networks and use min(Q₁, Q₂) as the target. The smaller estimate under-shoots, killing the overestimation bias (borrowed from TD3).

α^*

Auto-tuned α

Learn the temperature

α isn't a hand-set knob — it's a learned parameter, gradient-descended to hold the policy's entropy at a target H_target = −|A|. High early, low late.

μ,σ → a

Actor output → action

Squashed Gaussian

The actor outputs a mean μ(s) and std σ(s). Sample noise ε ∼ 𝒩(0,1), form n = μ + σ·ε, then squash: a = tanh(n) — into the valid action range. (Reparam trick lets ∇ flow through the sample.)

SAC's networks — what goes in, what comes out

SAC trains three learnable networks: one stochastic actor and two identical critics (plus slow-moving target copies of the critics). Note the key difference from PPO: the SAC critic takes both the state and the action as input — it scores a specific (s, a) pair, not just a state.

Actor — π_θ(a|s)

stochastic Gaussian + tanh policy

INPUTstate s — the observation vector

OUTPUTa Gaussian's mean μ(s) and std σ(s); the action is the reparametrised, squashed sample a = tanh(μ + σ·ε), ε ∼ 𝒩(0,1)

Twin Critics — Q_φ1, Q_φ2(s,a) ×2

two identical Q-nets; target uses their min

INPUTstate s and action a, concatenated — it scores a specific pair

OUTPUTone scalar Q(s,a) — soft action-value. Two nets run in parallel; SAC trains on min(Q₁, Q₂) to fight overestimation.

Network architectures — SAC runs in discrete and continuous

SAC works in both the continuous-action and the discrete-action setting — the body of the nets is the same MLP, only the actor/critic heads change.

In the discrete-action setting

The actor takes a state and returns probabilities over actions p(a_i|s).
The critic takes a state and returns one Q-value per action Q(s, a_i).

In the continuous-action setting

The critic takes a state and an action vector → a scalar Q-value Q_θ(s, a).
The actor needs a distribution parameterisation. SAC uses a squashed Gaussian: a = tanh(n), n ∼ 𝒩(μ_φ, σ_φ).
So the actor returns μ_φ and σ_φ.

Updates — the whole loop in one box

SAC inner loop

sample (s, a, r, s') from replay buffer critic target: y = r + γ · ( min(Q₁^tg(s', ã), Q₂^tg(s', ã)) − α · log π(ã|s') ) where ã ∼ π_θ(·|s') critic loss: L_{Q_i} = E [ ( y − Q_i(s,a) )² ] actor loss: L_π = E [ α · log π(a|s) − min(Q₁, Q₂)(s, a) ] with a = μ + σ·ε, ε ∼ 𝒩 α loss: L_α = E [ −α · ( log π(a|s) + H_target ) ] // auto-tune temperature targets: θ^tg ← τ θ + (1−τ) θ^tg

SAC — full pseudocode

Pseudocode · off-policy, one update per environment step

# ── INITIALISE ────────────────────────────────────────── initialize actor πθ, critics Q_φ1, Q_φ2 # random θ, φ1, φ2 φ1^tg ← φ1 ; φ2^tg ← φ2 # target critics = exact copies initialize temperature α # learnable; H_target = −|A| replay buffer 𝓓 ← ∅ for each environment step: # ── ACT & STORE (off-policy) ─────────────────────── a ~ πθ(·|s) # a = tanh(μ + σ·ε), ε ∼ 𝒩(0,1) s′, r, done ← env.step(a) 𝓓.append( (s, a, r, s′, done) ) ; s ← s′ # ── LEARN from a random minibatch ────────────────── sample B = {(s, a, r, s′)} ~ 𝓓 ã′ ~ πθ(·|s′) ; logπ′ ← log πθ(ã′|s′) # critic target: twin-min minus entropy y ← r + γ·( min(Q_φ1^tg(s′,ã′), Q_φ2^tg(s′,ã′)) − α·logπ′ ) # ▲ entropy term L_Qi ← ( Q_φi(s,a) − y )² for i=1,2 # regress BOTH critics (↓) # actor: maximise Q while staying random ã ~ πθ(·|s) ; logπ ← log πθ(ã|s) L_π ← α·logπ − min(Q_φ1(s,ã), Q_φ2(s,ã)) # ▲ entropy-regularised actor # temperature: hold entropy at the target L_α ← −α·( logπ + H_target ) # ▲ new in SAC: auto-tuned α φi ← Adam(∇L_Qi) ; θ ← Adam(∇L_π) ; α ← Adam(∇L_α) φi^tg ← τ·φi + (1−τ)·φi^tg # Polyak soft update, τ ≈ 0.005

nets used πθ, Q_φ1, Q_φ2 + targetsupdated θ, φ1, φ2, α · Polyak▲ new in SAC max-entropy objective (α·log π) + auto-tuned temperature α + stochastic reparam actor

Unlike PPO's collect-then-optimise rhythm, SAC interleaves one gradient update per environment step and learns from a giant replay buffer of old transitions — that off-policy reuse is why it is so sample-efficient. Each step updates four things: both critics regress to the entropy-augmented target y, the actor climbs min(Q₁,Q₂) while keeping its entropy high, and the temperature α self-tunes to hold that entropy at H_target. The target critics trail behind by Polyak averaging.

SAC in action — watch entropy & α fall as it learns

The same continuous-control tasks, now under SAC. Drag the episodes slider: early on the policy is very random (high entropy H[π], high temperature α) — late in training the auto-tuned α drops and the policy concentrates on good actions, without ever fully collapsing. The two meters track the slider in real time.

episodes trained ep0 Step0 Score0

H[π] entropy

2.00

α temperature

0.75

Auto-tuned α holds entropy near a target H_target = −|A|; both fall as the episode count climbs — exploration → exploitation.

SAC's networks live this step — the stochastic actor π(a|s) outputs μ & σ (a squashed Gaussian, a = tanh(μ+σ·ε)); the twin critics Q₁,Q₂(s,a) score it (target = their min). blue = state, gold = action. σ shrinks as the episode slider climbs. Use ⏭ Step.

DDPG vs TD3 vs SAC at a glance

	DDPG	TD3	SAC
Policy	Deterministic μ(s)	Deterministic μ(s)	Stochastic Gaussian + tanh
Exploration	External noise (fragile)	External noise + target smoothing	Built into objective
Critics	1 Q-network	2 (clipped min)	2 (clipped min)
Overestimation fix	—	Twin-min + target smoothing	Twin-min + entropy
Actor update cadence	Every step	Delayed (every d steps)	Every step
Temperature	—	—	Auto-tuned α
Stability across seeds	Notoriously brittle	Much improved	Robust
Practical default for continuous control	Historical	Strong (deterministic)	Current default

The family tree — how every algorithm descends from one idea

Two trunks grow from a single root. The value-based line (from the DQN lab) learns Q(s,a) and acts greedily; the policy-gradient line (this lab) learns π(a|s) directly. They meet in the middle for continuous control, where DDPG borrows DQN's replay/target tricks and bolts them onto an actor — then TD3 and SAC harden it. Solid arrows = direct successor; dashed = an idea borrowed across families; gold = today's go-to default.

Value-based · DQN lab Policy gradient · this lab Continuous control · this lab Current default / SOTA direct successor borrowed idea

All ten algorithms — side by side

📚 Further reading — OpenAI Spinning Up in Deep RL · Sutton & Barto, Reinforcement Learning: An Introduction.

Algorithm	Year	On / off-policy	Actor	Critic	Action space	Key idea
DQN	2013	Off	—	Q(s,a)	Discrete	Deep Q-learning + replay
REINFORCE	1992	On	π(a\|s)	—	Discrete	Monte-Carlo policy gradient
REINFORCE+baseline	~1995	On	π(a\|s)	V(s)	Discrete	Variance reduction by V baseline
A2C / Actor-Critic	2016	On	π	V	Discrete	n-step advantage, parallel envs
A3C	2016	On	π	V	Discrete	Async workers, lock-free
PPO	2017	On	π	V	Both	Clipped ratio, multi-epoch reuse
DDPG	2015	Off	μ(s) deterministic	Q(s,a)	Continuous	Chain-rule policy gradient
TD3	2018	Off	μ(s) deterministic	2 × Q	Continuous	Twin Q + delayed updates + target smoothing
SAC	2018	Off	Stochastic π	2 × Q	Continuous	Max-entropy + twin Q + auto α

Playground — same policy, two environments

Pick an algorithm and an environment. The policy used is hand-coded to mimic what that algorithm would actually learn (random → competent → optimal). Both envs are continuous-action: Lunar Lander uses a 2-D Gaussian (main + side thrust), Pendulum uses a 1-D Gaussian (torque). Every algorithm here can handle both — that's the whole point of policy gradients.

Env

Algorithm

Train ep500 Step0 Return0.00 Action—

Drag the Train ep slider to see what the policy looks like at different stages of training. The earlier the episode, the more random the policy.

π(·|s) 1 · Stochastic Policy — when two states look identical

✊ ✋ ✌ A second face — Rock · Paper · Scissors

∂π/∂θ 2 · Direct optimisation — walk uphill on the return

∞ 3 · Continuous actions — argmax has nothing to enumerate

🧠 The fix — one neuron that outputs the angle directly

Actor πθ(a|s)

Critic Vφ(s)

① Pull

② Push (async)

③ Decorrelate

Actor — πθ(a|s)

Critic — Vφ(s)

Why add noise at all?

Why clip it?

1 · SFT

2 · Reward model

3 · PPO

Why PPO here — and exactly when it runs

Actor — μθ(s)

Critic — Qφ(s,a)

Actor — μθ(s)

Twin Critics — Qφ1, Qφ2(s,a) ×2

Actor — πθ(a|s)

Twin Critics — Qφ1, Qφ2(s,a) ×2

In the discrete-action setting

In the continuous-action setting

Actor π_θ(a|s)

Critic V_φ(s)

Actor — π_θ(a|s)

Critic — V_φ(s)

Actor — μ_θ(s)

Critic — Q_φ(s,a)

Actor — μ_θ(s)

Twin Critics — Q_φ1, Q_φ2(s,a) ×2

Actor — π_θ(a|s)

Twin Critics — Q_φ1, Q_φ2(s,a) ×2