AI-Lab
Deep RL · Policy Optimisation

Actor-Critic & Policy Gradients

The natural sequel to the Deep RL lab. DQN learns a value and acts greedily — but what if the action space is continuous, or the optimal policy is stochastic? Meet REINFORCE, A2C, A3C, PPO, DDPG and SAC.

Continues from Deep RL · DQN / PER / DDQN / DDDQN
DQN — a value network: state in, one Q-value per action out
state s  →  hidden · ReLU  →  Q(s,a) for every action action = argmaxa Q
This is the network the Deep RL lab ended on. The input (highlighted) is the state s; each output neuron is the estimated value Q(s,a) of taking that action. To act, DQN is deterministic: it simply takes the argmax — the action with the highest Q-value (gold ring). No sampling, no distribution — one greedy choice. That single design decision is exactly what breaks down on the next page.
DQN was great — until it wasn't

The Deep RL lab ended at DDDQN — a Q-network that learns Q(s,a) for every action and picks argmaxa Q(s,a). That works beautifully on Atari (18 discrete buttons) and Cartpole (push left / push right). But the moment we touch a robot arm, a self-driving car, or anything with a steering wheel and a throttle, DQN breaks.

The trouble is the argmax. With a discrete action set you can enumerate every action and pick the largest Q. With a continuous action — a torque in [-2, +2] Newton-metres, or a steering angle in [-30°, +30°] — there are infinitely many actions. You cannot enumerate them.

Inside πθ — the policy network, end to end
state s  →  hidden · ReLU  →  πθ(a|s) · softmax the policy network
It's the same MLP you used for Q(s,a) in the DQN lab — only the head changes. The input (highlighted) is the state s; the last layer emits raw logits that a softmax squeezes into a probability distribution where each output neuron is one πθ(ai|s) and they all sum to 1. We don't take an argmax — we sample an action from this distribution, which is exactly what lets the policy stay stochastic and differentiable.
Three places DQN cannot reach
Continuous actions
Steering · torque · throttle
argmax over an infinite set is not tractable. You'd need a continuous optimiser inside every action call.
π(·|s)
Stochastic Policy
Rock-paper-scissors
A deterministic policy gets exploited. The optimal solution is a 1/3, 1/3, 1/3 mix — DQN cannot represent that.
∂π/∂θ
Direct optimisation
Skip the value
If you only care about acting well, why fit a whole Q-table? Differentiate the return through the policy and walk uphill.

π(·|s) 1 · Stochastic Policy — when two states look identical

Frozen Lake · perceptual aliasing

Suppose the agent can't see its absolute position — it only senses its neighbouring tiles. On this lake the two purple tiles have the exact same view (ice on the left, ice on the right): they are aliased. To the network they are the same state.

A deterministic policy maps that one view to one action. But the goal is to the right of the left tile and to the left of the right tile — opposite directions! Whichever it commits to, one twin skates straight into a hole, doomed with 100% certainty.

A stochastic policy goes left/right 50/50 from that view, so neither start is doomed — exploration carries it to the goal from both. Two bonuses fall out for free:

  • We get an exploration / exploitation trade-off baked into the policy — no ε-greedy schedule to hand-tune.
  • We get rid of the perceptual-aliasing problem: identical observations can still map to a distribution of actions.
Two agents start on the aliased (purple) tiles — identical view.

✊ ✋ ✌  A second face — Rock · Paper · Scissors

Stochastic Policy · the ⅓-⅓-⅓ Nash mix

Switch domains entirely: you're playing repeated Rock-Paper-Scissors against an opponent who watches your habits and best-responds. Any deterministic policy — “always ✊” — is instantly exploited: the opponent just plays ✋ forever and you lose every round.

The only unexploitable play is the mixed policy ⅓ ✊, ⅓ ✋, ⅓ ✌ — the game's Nash equilibrium. There is no single best action to argmax; the optimum is a distribution. A value-greedy DQN literally cannot represent that — a stochastic policy gets it for free.

Opponent best-responds to your move frequencies.

∂π/∂θ 2 · Direct optimisation — walk uphill on the return

Skip the value · gradient ascent on J(θ)

Value methods fit the whole Q-surface and then act greedily — a lot of machinery if all you really want is to act well. Policy gradient differentiates the expected return through the policy and simply steps: θ ← θ + α ∇θ J(θ).

The heat-map is J(θ) over two policy parameters — brighter is higher return. The ball follows the gradient straight uphill to a peak. The honest catch (it's in the table just below): it converges to a local optimum, not always the global one. Hit new start θ a few times and watch it sometimes settle on the smaller hill.

θ starts somewhere random — press Ascend.

3 · Continuous actions — argmax has nothing to enumerate

Steering · torque · throttle

DQN acts by argmaxa Q(s,a): score every action, take the largest bar. Fine for 18 Atari buttons. But the front-wheel angle of a car lives in [−30°, +30°] — an infinite set. There is nothing finite to loop over.

Discretising into bins is the usual hack, and you can drag the slider to feel the trap: coarse bins miss the true optimum (red ring), fine bins find it but explode — k bins over d action-dims = kᵈ outputs. A policy instead emits the parameters of a distribution (a Gaussian's μ, σ) and reads the best action off directly — no search at all.

🧠 The fix — one neuron that outputs the angle directly

Deterministic continuous policy · a = μ(s) ∈ [−30°, +30°]

No bins, no argmax. The network keeps the same body but ends in a single output neuron; a tanh squashes it to [−1, 1], then we scale by 30 to get the front-wheel angle in [−30°, +30°]. The continuous action is read straight off the head — and because it's a smooth function of θ, we can push gradients through it.

Value-based vs. policy-based — at a glance
PropertyValue-based (DQN family)Policy-based (this lab)
LearnsQ(s,a)π(a|s)
Picks action viaargmax over QSample from π
Discrete actions?✅ Native✅ Categorical π
Continuous actions?❌ argmax intractable✅ Gaussian / Tanh π
Stochastic policies?❌ Always greedy✅ Built-in
Policy Gradient (PG) — the general recipe

Policy Gradient (PG) is the whole family this lab is built on, resting on one idea: raise the log-probability of actions that paid off, lower the ones that didn't. Let's first pin down the general estimator and the training loop every method here inherits — then build its simplest concrete instance, REINFORCE, and its variance-reducing upgrade, the baseline, further down this page.

The new objective: maximise expected return
Policy-gradient objective
J(θ) = E [ R(τ | πθ) ] // expected return under the policy θ J(θ) = Eτ [ θ log πθ(a|s) · R(τ) ] // Policy Gradient Theorem θθ + α · θ J(θ) // just gradient ascent (α = learning rate)

Instead of fitting a value table, we parameterise the policy directly as πθ(a|s) — a neural network whose output is a probability over actions (or, for continuous control, the parameters of a distribution). We then compute the gradient of the expected return with respect to θ and walk uphill.

The magic is the log-likelihood trick: E[R] = E[ log π · R]. The right-hand side is a sample-able expectation — we just roll out the policy, multiply each log-prob by the return that followed, and that's an unbiased gradient estimate. No bootstrapping, no targets, no replay buffer.

The Policy Gradient Theorem — where the log comes from

We can't differentiate J(θ) directly, because θ sits inside the distribution we're averaging over, not inside the thing being averaged. The fix is one line of calculus — the log-derivative (score-function) trick:

Deriving the theorem
J(θ) = E [ R(τ | πθ) ] = Στ P(τ;θ) R(τ) // expected return = a P-weighted sum θ J(θ) = θ Στ P(τ;θ) R(τ) // sum over trajectories τ = Στ θ P(τ;θ) R(τ) // gradient of a sum = Στ P(τ;θ) / P(τ;θ) · θ P(τ;θ) R(τ) // ×1 trick: multiply & divide by P = Στ P(τ;θ) · θ P(τ;θ) / P(τ;θ) · R(τ) // regroup = Στ P(τ;θ) θ log P(τ;θ) R(τ) // since ∇log P = ∇P / P = Eτ[ θ log P(τ;θ) · R(τ) ] // a P-weighted sum = an expectation and log P(τ;θ) = log p(s0) + Σt [ log πθ(at|st) + log p(st+1|st,at) ] so θ log P(τ;θ) = Σt θ log πθ(at|st) // dynamics have no θ → they vanish!θ J(θ) = Eτ[ Σt θ log πθ(at|st) · R(τ) ] // Policy Gradient Theorem

Two things make this theorem so useful. First, the environment dynamics p(st+1|st,at) don't depend on θ, so when we take the log-gradient they drop out completely — we never need a model of the world. Second, the result is an expectation, so we estimate it by simply rolling out the policy and averaging θ log πθ(a|s) weighted by the return R(τ).

Reading it intuitively: θ log πθ(a|s) points in the direction that makes action a more likely; multiplying by R(τ) means good trajectories push their actions up and bad ones push theirs down. That is the entire idea behind every algorithm in this lab — REINFORCE is just this estimator with R(τ) = the Monte-Carlo return.

REINFORCE — Monte-Carlo policy gradient (Williams, 1992)

REINFORCE is the simplest possible algorithm that learns by walking up the policy-gradient hill. The recipe is just three lines:

📄 Original paper — Williams (1992), Simple statistical gradient-following algorithms for connectionist reinforcement learning.

REINFORCE — one episode
1. Roll out a whole episode with πθ: τ = (s0, a0, r0, s1, a1, r1, …, sT) 2. For every step t, compute the remaining-return: Gt = Σk=t..T γk-t · rk 3. Take one gradient step: θθ + α · Σt θ log πθ(at|st) · Gt
Pseudocode · the full training loop
# ── INITIALISE ────────────────────────────────────────── initialize policy network πθ # random parameters θ for iteration = 1, 2, … : # …and repeat with the updated params # ── PLAY n GAMES (rollouts): sample, don't argmax ─── D ← [ ] for game in 1 … n: τ ← [ ] ; s ← env.reset() while not done: a ~ πθ(·|s) # sample an action from the policy s′, r, done ← env.step(a) τ.append( (s, a, r) ) ; s ← s′ D.append(τ) # ── LABEL every decision good / bad by its outcome ── for τ in D: for t in τ: G_t ← Σ_k≥t γ^(k−t) · r_k # return after t = the "label" # ── UPDATE: win → raise prob (↑), lose → lower (↓) ── J ← (1/n) Σ_τ Σ_t log πθ(a_t|s_t) · G_t # weight each log-prob by its return θ ← θ + α · ∇θ J # one gradient-ascent step on J(θ)
πθ used sample / score actionθ updated gradient ascent▲ new in REINFORCE weight ∇log π by the Monte-Carlo return G_t

So what is Gt? It's the return-to-go (a.k.a. remaining return): the total discounted reward collected from step t to the end of the episode. It's the score we hang on the action at"given that I took this action here, how did the rest of the episode actually turn out?"

Gt = rt + γ rt+1 + γ2 rt+2 + … + γT−t rT

Two things to notice. The discount γ ∈ [0, 1) makes reward that arrives sooner count for more than far-future reward. And Gt sums only rewards from t onward — an action is never credited for reward that was already banked before it was taken (that's the "reward-to-go" idea).

In the update, Gt is simply the weight on each action's log-prob gradient: a big positive Gt shoves πθ(at|st) up, a negative one shoves it down. Because it's the actual sampled return — not an estimate — it's unbiased, but it swallows every random event for the rest of the episode, which is exactly why it's so noisy (the variance problem below).

Learning curve — and why it's noisy

This is a simulated REINFORCE training run on Lunar Lander — return per episode (light) and a 50-episode running mean (bold). Notice the massive episode-to-episode variance: a single unlucky rollout can drop the return by 200. That noise is precisely what the next tab fights.

REINFORCE-with-baseline — the algorithm
Two networks: actor & baseline
actor: πθ(a|s) // outputs action probabilities baseline: Vφ(s) // scalar value estimate 1. Roll out an episode. 2. Compute Gt for every step. 3. Compute advantage: At = GtVφ(st) 4. Update actor: θθ + απ · θ log π(at|st) · At 5. Update baseline: φ ← φ + αV · φ ( GtVφ(st) )2 // MSE regression to Gt
The baseline trick — keep the bias, kill the variance

Here is a beautiful mathematical fact: for any function b(s) that depends only on the state (not the action), subtracting it from the return inside the gradient is free — the expected gradient is unchanged.

Why the baseline is unbiased
E [ log π(a|s) · b(s) ] = b(s) · E [ log π(a|s) ] = b(s) · E [ 1 ] = b(s) · 1 = 0 ⇒ J = E [ log π(a|s) · ( Gtb(st) ) ] // same gradient, lower variance

So we can subtract anything that's a function of state. The best choice — the one that minimises variance — is the state-value function V(s). That gives us the advantage.

Before vs. after — same task, less noise

Red = vanilla REINFORCE. Green = REINFORCE-with-baseline. Same final return, same number of episodes — but the green curve is dramatically smoother. That smoothness translates directly into being able to use a larger learning rate, which translates into faster real-world learning.

From REINFORCE to Actor-Critic

The policy gradient is the same one REINFORCE used — every Actor-Critic method just changes the weight on θ log π. REINFORCE weights it by the noisy Monte-Carlo return Gt; Actor-Critic swaps that for a bootstrapped estimate built from the critic's value V(s).

The one change Actor-Critic makes
REINFORCE: θ J(θ) = Eτ [ θ log πθ(a|s) · Gt ] // full Monte-Carlo return Gt ──swap──▶ V(s) // use the critic's value V(s) Actor-Critic: θ J(θ) = Eτ [ θ log πθ(a|s) · V(s) ] // just the critic's value V(s)
Combining policy and value learning

Actor πθ(a|s)

state s policy net θ π(a₀) π(a₁) π(a₂)

Outputs probabilities over actions (or a distribution mean/std for continuous). Updated by the policy-gradient with the critic's advantage.

Critic Vφ(s)

state s value net φ V(s)

Outputs a single scalar — the expected return from this state. Updated by TD regression to r + γ V(s').

The actor picks the action. The critic tells the actor whether the action was better or worse than expected. The actor takes a tiny gradient step accordingly. Round trip: ~5 ms. No full episode required.

Actor-Critic stitches the two families together: the actor is a policy-based learner, the critic is a value-based learner, and each one patches the other's weakness.

The actor is the policy πθ(a|s) — it selects actions and is what we ultimately deploy. The critic learns a value function Vφ(s) (or Qφ(s,a)) — it never picks actions; it just critiques the actor's choices with evaluative feedback.

The critic's signal: the TD error
actor: θθ + α θ log πθ(at|st) · Vφ(st) // nudge policy by the critique critic: φ ← φ + α δt φ Vφ(st) // shrink its own TD error (δ²)

Instead of waiting for the noisy Monte-Carlo return Gt, the actor reads the critic's value V(s) — its estimate of how good the current state is — and nudges its policy with that critique instead of a whole episode's return. Meanwhile the critic keeps fitting V(s) more accurately, so the signal the actor learns from gets sharper as training goes on.

That bargain buys three things at once: lower variance (the critic's bootstrapped estimate is far steadier than a whole-episode return), faster learning (we update every step instead of every episode), and a natural fit for continuous action spaces — exactly the limits of purely policy-based or purely value-based methods.

A2C (Advantage Actor-Critic)
The one change Actor-Critic makes
Actor-Critic: θ J(θ) = Eτ [ θ log πθ(a|s) · V(s) ] // just the critic's value V(s) At = Q(s,a)V(s) // the advantage with Q(s,a) ──swap──▶ r + γ V(s′) // bootstrap the action-value A2C: θ J(θ) = Eτ [ θ log πθ(a|s) · At ] // At = the value-based advantage above

A2C makes one more upgrade: it replaces the bare value V(s) with the advantage At = Q(s,a)V(s). Where V(s) only says "how good is this state on average?", At asks the sharper question: "how much better (or worse) was this action than the critic's average expectation for the state?"

That centring is what makes the gradient informative — actions that beat the baseline V(s) get pushed up, actions that fall short get pushed down — while keeping variance low.

Actor-Critic — the leap to bootstrapping
TD-style advantage
TD advantage: At = rt + γ · V(st+1) − V(st) actor: θθ + απ · θ log π(at|st) · At critic: φ ← φ + αV · φ ( rt + γ V(st+1) − V(st) )2
One-step Actor-Critic — pseudocode

The two updates happen every transition. The critic lines fit the value function; the actor line nudges the policy using the advantage At — which, for one-step Actor-Critic, is the TD error r + γV(s′) − V(s).

initialize actor πθ and critic # random θ, φ for each episode: s ← env.reset() while not done: a ~ πθ(·|s) # ACTOR selects an action s′, r, done ← env.step(a) # ── CRITIC: evaluate → advantage (TD error) ── Atr + γ (s′) − (s) # advantage = TD error φ ← φ + αV · At ∇φ (s) # ← UPDATE CRITIC (fit value) # ── ACTOR: improve policy with the critique ── θ ← θ + απ · ∇θ log πθ(a|s) · At # ← UPDATE ACTOR (raise/lower π) s ← s′ # online: one step per transition
network used (forward) actor update  θ  (the policy) critic update  φ  (the value) ▲ new in AC: bootstrapped TD advantage
Step through the algorithm
episode0 step0
initialize actor πθ and critic # random θ, φ for each episode: s ← env.reset() while not done: a ~ πθ(·|s) # sample an action s′, r, done ← env.step(a) # ── CRITIC: advantage, then fit value ── Aₜ ← r + γ (s′) − (s) # advantage φ ← φ + αV · Aₜ ∇φ (s) # UPDATE CRITIC # ── ACTOR: improve the policy ── θ ← θ + απ · ∇θ log πθ(a|s) · Aₜ # UPDATE ACTOR s ← s′ # loop back
s  state
a  action
r  reward
s′ next
Aₜ advantage
V(s) critic
π  L / R
state s ACTOR πθ(a|s) ENVIRONMENT step(a) → r, s′ CRITIC Vφ(s) advantage Aₜ r + γV(s′) − V(s) action a r, s′ V(s), V(s′)
Press ▶ Step to walk the algorithm line by line — watch the variables update.
Why a stochastic policy wins — games a deterministic policy can't

A2C's actor is a stochastic policy π(a|s): it outputs a distribution over actions and samples — unlike a value-greedy or deterministic policy a = μ(s) that always picks one fixed action. Why does that matter? When the best behaviour is to be unpredictable, or when many states look identical, a deterministic policy can be read, exploited, or trapped. Each game below pits a deterministic policy against a stochastic one — watch deterministic lose. Representing a distribution over actions is exactly the superpower a policy-gradient method like A2C gives you.

Opponent
your move → Round0 Your score0

Now you play against the model. Pick the Deterministic opponent — it always throws ✊, so just play ✋ every time and you win every round. A fixed policy is a habit, and habits get exploited.

Switch to the Stochastic opponent (⅓ ✊, ⅓ ✋, ⅓ ✌ — the Nash equilibrium) and try again: there's no pattern to punish, so however you play your score just hovers around zero — unexploitable. The optimum here is a distribution, something a = μ(s) can't represent. (▶ Auto-play me plays the best counter for you.)

A3C — Asynchronous Advantage Actor-Critic

Before A2C, the same group published A3C — the same loss, but each worker runs its own copy of the env on its own CPU thread and pushes gradients to a shared parameter server without waiting. The result feels like SGD on a chaotic mini-batch — and somehow, it works.

One worker · one inner loop
loop: θ' ← global.θ // ⇣ pull: local net updated from global collect n steps with πθ' compute At for each step compute ∇L (actor + critic + entropy) // no lock! no average! just shoot it at the server: global.θ ← global.θ + α · ∇L
local net used πθ' rolloutlocal net updated θ' ← global.θ (pull)▲ new in A3C asynchronous, lock-free push of ∇L into the shared global net
Why no locks doesn't break

Two workers pushing gradients at the same time will sometimes overwrite each other. This is called Hogwild! updating, and it provably converges as long as the updates are sparse-enough. In practice the workers are also de-synchronised — they're at different points in different episodes — so their gradients are diverse, which helps reduce correlation.

The big advantage of A3C over single-thread methods at the time was not faster gradients per second — it was that the diversity of trajectories acted like a replay buffer would in DQN, decorrelating updates and removing the need for one.

How A3C learns — one brain, many hands

A3C is built around a single global network holding the shared parameters (θ for the actor, φ for the critic). Around it run many independent workers, each with its own copy of the networks and its own environment instance — so exploration happens in parallel across CPU threads.

What each worker computes (n-step)
n-step return: Rt = Σk=0..n-1 γk rt+k+1 + γn Vφ(st+n) advantage: At = RtVφ(st) actor loss: Lπ = − log πθ(at|st) · At − β · H[π] // entropy bonus β critic loss: LV = ( RtVφ(st) )2

① Pull

each worker copies the latest global θ, φ into its local nets, then rolls out n steps in its own env.

② Push (async)

it computes L locally and fires it straight at the global network through a shared optimiser (RMSProp/Adam) — no lock, no waiting (Hogwild!).

③ Decorrelate

workers sit at slightly different policy versions, so their experience is diverse — this replaces the replay buffer and stabilises training.

Adapted from APXML · Asynchronous Advantage Actor-Critic (A3C).

Global ⇄ workers — actor & critic nets with live neuron values
workers: global updates0
Each component holds two tiny networks: a purple actor πθ (state → action probs) and a teal critic Vφ (state → value). The number on every neuron is its live activation from a real forward pass of the same input state s (top-left). Top = the GLOBAL nets; each worker has its own local copy. Use ▷ Step to advance one rollout tick at a time — each click, every worker takes a step (its bar fills); when a bar completes the worker pushes its gradient ⇡ into the global net (neurons shift and flash gold), bumps its episode counter, and syncs ⇣ the fresh global weights back. Each worker's progress bar shows its episode count and rollout %; the workers run desynchronised, so they complete at different ticks. Hit ▶ Play to auto-step.
PPO — the workhorse (Schulman et al., 2017)

Vanilla policy gradient has a vicious failure mode. If a single update is too large, the policy can shift so far that the next batch of rollouts comes from a totally different distribution — and the gradient estimate from that batch becomes useless. The policy collapses, and you have to start over.

Trust-Region Policy Optimisation (TRPO) solved this with a second-order constraint on the KL divergence between old and new policies. Beautiful, but heavy. PPO is the same idea, but simple enough to fit on a napkin.

PPO-clip objective
ratio: rt(θ) = πθ(at|st) / πθ_old(at|st) surrogate: L1 = rt(θ) · At L2 = clip( rt(θ), 1−ε, 1+ε ) · At PPO-clip: LCLIP = E [ min( L1, L2 ) ] ε ≈ 0.2 ⇒ "no update is allowed to move the ratio more than 20%"

Two ideas, that's the whole trick:

  • ratio rt — instead of A2C's raw log π, PPO measures how much the new policy differs from the old one that collected the batch: rt = πθ / πθ_old. rt = 1 means "unchanged"; rt > 1 means the new policy makes that action more likely.
  • clip — the surrogate rt·At would happily push rt far from 1 on a single good batch and blow up the policy. So PPO also computes the clipped version (rt pinned to [1−ε, 1+ε]) and keeps the min of the two. Once a step has moved the ratio past ±ε, the clip flattens the gradient — there's no more reward for moving further. That's the trust region, enforced with one min.
What PPO changes from A2C

PPO keeps A2C's whole backbone — a stochastic actor, a value critic, GAE advantages, parallel envs, the entropy bonus. It changes three things, and those three are what turn A2C into the on-policy default.

rt
① Probability ratio
Importance weight
A2C: ∇log π·A on fresh data. PPO uses rt = πθθ_old against the policy that collected the batch — so it can train on slightly-old data.
clip
② Clipped trust region
Bounded step
A2C's one unbounded step can collapse. PPO clips rt to [1−ε, 1+ε] and keeps the min — no update moves more than ~ε. TRPO's guarantee, first-order.
×10
③ Multi-epoch reuse
Reuse the batch
A2C: 1 update/batch, then discard. PPO: ~4–10 epochs of minibatch SGD per batch — safe because the clip keeps πθ near πθ_old. ~10× learning per env step.
PPO's two networks — what goes in, what comes out

PPO is an actor-critic method: one network to act, one to judge. In code they usually share a torso and split into two heads, but conceptually they are two distinct functions — here is the name, input and output of each.

Actor — πθ(a|s)

the policy we deploy
state s policy net θ π(a₀) π(a₁) π(a₂)
INPUTstate s — the observation vector (e.g. the 8 Lunar-Lander sensors)
OUTPUTa distribution over actions — softmax π(a|s) for discrete, or Gaussian μ(s), σ(s) for continuous. We sample the action from it (never argmax).

Critic — Vφ(s)

the value baseline for the advantage
state s value net φ V(s)
INPUTstate s — the same observation vector the actor sees
OUTPUTone scalar V(s) — expected return from s. Feeds the GAE advantage Ât, which weights every policy update.
Clip Noise

Heads-up — a different clip. PPO clips the probability ratio r = π/πold; a sibling continuous-control method, TD3, clips Gaussian noise instead — same word, a very different job. Here is that other clip, up close. Target policy smoothing replaces the single target action with a small noisy neighbourhood around it: ã = μθ'(s′) + clip( 𝒩(0, σ̃), −c, +c ). Two ideas stacked: a Gaussian for the noise, and a clip to bound it. The graph shows both — drag the sliders to see how the clip folds the Gaussian's long tails back onto ±c.

noise σ̃ = 0.20 clip c = 0.50 noise clamped to ±c
The blue bell is the raw Gaussian noise; the teal area is the part kept unchanged; the coral spikes at ±c are the long tails folded back onto the boundary by the clip. Shrink c or grow σ̃ and watch more probability pile up at the edges.

Why add noise at all?

The critic is trained to fit Q at exactly the actor's target action. If that Q-surface has a sharp, spurious peak (function-approximation error), the deterministic actor steers straight into it. Sampling a little noise around the target action and averaging turns the target into a smooth local average of Q — a SARSA-like regulariser, so the policy can't exploit a one-pixel spike.

Why clip it?

A raw Gaussian has unbounded tails — every so often it would throw the target action far from μ(s′), into a totally different action regime and poison the target with nonsense. Clipping to [−c, +c] keeps the smoothing local and controlled: average over a small neighbourhood, never over wild outliers. Smooth — but only a little.

What does clipping actually do?

The horizontal axis is the ratio r = π/π_old. The vertical axis is the PPO objective for one (s, a, A) sample.

When A > 0 (good action): the loss rises until r=1+ε, then flattens. Even if the network could make the action much more probable, PPO refuses to reward it.

When A < 0 (bad action): the loss falls until r=1−ε, then flattens. PPO refuses to push the action probability below (1−ε)·π_old.

The min ensures that the pessimistic bound applies — if the network tried to be optimistic, PPO clips it. The net effect: every update step is contained in a "trust region" of ratio space.

Multiple epochs per batch — the second PPO superpower
PPO training loop
1. Roll out N steps with πθ_old on K parallel envs. 2. Compute advantages (typically with GAE). 3. For 4–10 epochs, mini-batch the rollout and step the loss. 4. πθ_oldπθ. Resample.

A2C does one gradient step per batch of rollouts. PPO does 10. Because the clip keeps the new π close to π_old, the rollouts stay roughly valid for the next epoch, and we squeeze 10× more learning out of every batch of expensive environment interactions. This is the real reason PPO is the modern default.

PPO — full pseudocode
Pseudocode · collect → GAE → optimise for E epochs
# ── INITIALISE ────────────────────────────────────────── initialize actor πθ and critic # random θ, φ (often a shared torso + 2 heads) πθ_oldπθ for iteration = 1, 2, … : # ── COLLECT: roll out the frozen πθ_old on K parallel envs ── D ← [ ] for N steps (× K envs): a ~ πθ_old(·|s) # sample; store logπ_old(a|s) and Vφ(s) s′, r, done ← env.step(a) D.append( (s, a, r, logπ_old, Vφ(s)) ) ; s ← s′ # ── ADVANTAGE: GAE-λ straight from the critic ────────── δ_t ← r_t + γ·(s_t+1) − (s_t) # one-step TD error Â_t ← Σ_l≥0 (γλ)^l · δ_t+l # generalised advantage estimate Ĝ_t ← Â_t + (s_t) # regression target for the critic # ── OPTIMISE: reuse the SAME batch for 4–10 epochs ───── for epoch = 1 … E: # ▲ new in PPO: reuse one batch E times for minibatch in D: r_t(θ) ← πθ(a_t|s_t) / πθ_old(a_t|s_t) # probability ratio L_clip ← min( r_t·Â_t, clip(r_t,1−ε,1+ε)·Â_t ) # ▲ new in PPO: clipped trust region L_V ← ( (s_t) − Ĝ_t )2 # critic MSE (↓) L ← −L_clip + c1·L_V − c2·H[πθ(·|s_t)] # +entropy bonus for exploration θ, φ ← Adam step on ∇L πθ_oldπθ # freeze a fresh copy → resample next batch
nets used πθ, πθ_old, Vφ (forward)θ,φ updated Adam on ∇L▲ new in PPO clipped surrogate + multi-epoch batch reuse

Two loops make PPO PPO. The outer loop collects a fresh batch and re-freezes πθ_old. The inner loop reuses that one batch for several epochs — safe only because the clip guarantees πθ never drifts far from πθ_old, so the off-policy ratio rt stays well-behaved. The actor climbs LCLIP, the critic regresses to Ĝt, and the entropy term keeps the distribution from collapsing too early.

PPO vs A2C — learning curves

Simulated learning curves on a Lunar-Lander-class env. Both algorithms reach the same final return — but PPO does it in ~⅓ the environment steps. The clip prevents catastrophic updates, the multiple epochs squeeze more juice from each batch, and the entropy bonus + GAE keep exploration alive throughout.

PPO in ChatGPT — the RL behind RLHF

PPO isn't just for robots and games — it's the algorithm that aligned ChatGPT. After a language model is pretrained and supervised-fine-tuned, the final polish is RLHF — Reinforcement Learning from Human Feedback — and the optimiser at its heart is PPO (OpenAI's InstructGPT → ChatGPT recipe).

The three-stage RLHF pipeline

1 · SFT

Supervised fine-tune the pretrained LM on human-written demonstrations of good answers. This becomes the starting policy πref.

2 · Reward model

Humans rank several model answers to the same prompt; train a reward model rφ to predict which answer a human prefers — a learned, automatic judge.

3 · PPO

Fine-tune the LM with PPO to maximise that reward model's score — pushing the policy toward answers humans rate highly.

How generating text becomes an RL problem
RL concept…in ChatGPT's RLHF
Policy πθ (actor)the language model itself — it "acts" by emitting tokens
State sthe prompt + every token generated so far
Action athe next token sampled from the vocabulary (a ~50k-way discrete action)
Episode / trajectorygenerating one full response, token by token
Reward rthe reward model's score of the finished answer — minus a per-token KL penalty
Critic Vφa value head on the LM estimating expected reward (feeds the advantage / GAE)
RLHF objective — PPO on a KL leash
maximise E [ rφ(prompt, answer) ] − β · KL( πθπref ) └─ reward model ─┘ └─ stay close to the SFT model ─┘

Why PPO here — and exactly when it runs

When: the final training stage, after pretraining and SFT. This is the alignment step that turns a raw next-token predictor into an assistant that follows instructions, refuses harmful requests, and sounds helpful.

Why PPO and not vanilla policy gradient: the clip plus the explicit KL penalty keep the fine-tuned model from drifting far from the SFT model — without that leash the policy "reward-hacks" the reward model and collapses into repetitive gibberish that scores high but reads terribly. PPO's multi-epoch reuse also squeezes maximum learning from each batch of expensive reward-model-scored generations, and it stays stable at billion-parameter scale.

Note: newer alignment methods (e.g. DPO) skip the explicit PPO loop by optimising the preference data directly — but PPO was the original recipe behind InstructGPT and ChatGPT, and is still widely used for RLHF.

DDPG — Deep Deterministic Policy Gradient (Lillicrap et al., 2015)

Everything so far was on-policy: roll out, learn, throw away. That's expensive when each environment step costs minutes (real robots, simulators). For continuous action spaces, can we go back to the DQN-style sample-efficient world of replay buffers and target networks?

DDPG says yes. The trick is to learn a deterministic policy μθ(s) = a — no distribution, just a function from state to a chosen action. Then the Q-function's gradient w.r.t. the action becomes the policy's gradient.

DDPG's two networks — what goes in, what comes out

DDPG is actor-critic for continuous control. The crucial change from a stochastic policy: the actor is deterministic — it outputs one action, not a distribution. And like DQN, the critic scores a state-action pair, so its input includes the action.

Actor — μθ(s)

deterministic policy
state s policy net θ a = μ(s)
INPUTstate s — the observation vector
OUTPUTone deterministic action a = μ(s) (e.g. a tanh-scaled torque). No distribution; exploration noise is added on top at act-time.

Critic — Qφ(s,a)

action-value, DQN-style
state s action a Q-net φ Q(s,a)
INPUTstate s and action a, concatenated
OUTPUTone scalar Q(s,a). Its gradient ∇aQ is what pushes the actor (chain rule) toward better actions.
Deterministic policy gradient
networks: μθ(s) → a // deterministic actor Qφ(s, a) → R // critic critic loss: LQ = E [ ( r + γ Qφ'(s', μθ'(s')) − Qφ(s,a) )2 ] // DQN-style actor loss: θ J = E [ θ μ(s) · a Q(s,a) |a=μ(s) ] // chain rule!
DDPG's three methods

Because the policy is deterministic and the updates use bootstrapped Q-targets, the data doesn't have to come from the current policy — so DDPG can borrow three off-policy stabilisers. Tap each card for how it works.

Replay buffer
Reuse old transitions
Keep a giant buffer of past (s, a, r, s′) and sample random mini-batches — off-policy, so every transition trains the nets many times.
τ·θ
Target networks
Slow-moving copies
Slowly-updated copies of both actor and critic give a stable regression target. Polyak averaging, not hard copies — τ ≈ 0.005 per step.
𝒩(0,σ)
Exploration noise
a = μθ(s) + 𝒩(0,σ)
A deterministic policy has no built-in randomness, so DDPG adds external noise (Ornstein-Uhlenbeck or Gaussian) — during training only; at eval, use the deterministic action.
DDPG — full pseudocode
Pseudocode · off-policy, one update per environment step
# ── INITIALISE ────────────────────────────────────────── initialize actor μθ and critic Q_φ # random θ, φ θ′θ ; φ′ ← φ # target nets = exact copies replay buffer 𝓓 ← ∅ for each environment step: # ── ACT with exploration noise & STORE ───────────── a ← μθ(s) + 𝒩(0, σ) # ▲ new in DDPG: deterministic action + Gaussian exploration noise s′, r, done ← env.step(a) 𝓓.append( (s, a, r, s′, done) ) ; s ← s′ # ── LEARN from a random minibatch ────────────────── sample B = {(s, a, r, s′)} ~ 𝓓 y ← r + γ · Q_φ′(s′, μθ′(s′)) # bootstrapped target (uses target nets) L_Q ← ( Q_φ(s,a) − y )2 # critic regression (↓) φ ← Adam(∇L_Q) # actor: push the action toward higher Q (chain rule) L_μ ← Q_φ(s, μθ(s)) # ▲ new in DDPG: deterministic policy gradient ∇θμ·∇aQ θ ← Adam(∇L_μ) # ── Polyak soft-update of BOTH target nets ───────── φ′ ← τ·φ + (1−τ)·φ′ ; θ′ ← τ·θ + (1−τ)·θ′ # τ ≈ 0.005
nets used μθ, Q_φ + targets (forward)θ,φ updated Adam · Polyak targets▲ new in DDPG deterministic actor trained through the critic (chain rule) + replay/target reuse

The whole loop runs off-policy: every environment step appends one transition to the replay buffer and does one update sampled from the entire history. The critic regresses to a bootstrapped target built from the target networks (the slow copies θ′, φ′), and the actor simply walks uphill on Q_φ(s, μθ(s)) — the chain rule turns ∇aQ into a gradient on θ. The fragile part is the 𝒩(0,σ) exploration noise bolted onto the action — exactly what SAC replaces with built-in entropy.

Watch DDPG learn — continuous-control tasks

Continuous actions are everywhere: a spaceship's thrusters, a steering wheel + throttle, a push force, a joint torque. Pick a task in the tabs below and an episode budget, and watch the deterministic actor μ(s) go from flailing to fluent — the same algorithm, just a different continuous output.

episodes trained ep0 Step0 Score0
DDPG's two networks, live this step: the actor μ(s) maps the state → the action; the critic scores the pair as Q(s,a), so its input is the state and the action. Inputs are colour-coded — blue = state s, gold = action a. Use ⏭ Step to advance one step and read the numbers off.

TD3 — Twin Delayed DDPG (Fujimoto et al., 2018)

DDPG is powerful but brittle: its single critic systematically over-estimates Q-values, the actor then exploits those bogus peaks, and training collapses. TD3 keeps DDPG's deterministic-actor backbone and fixes it with three small, surgical tricks — it's "DDPG done right".

TD3's networks — what goes in, what comes out

TD3 keeps DDPG's deterministic actor but doubles the critic: two identical Q-nets (plus target copies of all three). The actor's input/output are unchanged from DDPG — the fix lives entirely in the critics, whose min kills the overestimation bias.

Actor — μθ(s)

deterministic policy (same as DDPG)
state s policy net θ a = μ(s)
INPUTstate s — the observation vector
OUTPUTone deterministic action a = μ(s). Updated only every d steps (delayed); at the target, clipped noise is added for smoothing.

Twin Critics — Qφ1, Qφ2(s,a) ×2

two Q-nets; target uses their min
state s action a Q-net φ Q(s,a)
INPUTstate s and action a, concatenated
OUTPUTone scalar Q(s,a) per net. The TD target takes min(Q1, Q2) to under-shoot and cancel optimism bias.
The TD3 target & updates
target action: ã = μθ'(s') + clip(ε, −c, +c), ε ~ 𝒩(0, σ) // ① policy smoothing target value: y = r + γ · min( Qφ1'(s', ã), Qφ2'(s', ã) ) // ② clipped double-Q critics: minimise ( Qφi(s,a) − y )2 for i = 1, 2 actor: θθ + α θ Qφ1(s, μθ(s)) every d steps // ③ delayed update
Three tricks, one stable algorithm
min(Q₁,Q₂)
① Twin double-Q
Clipped double-Q
Two critics; the TD target takes the smaller one. Under-shooting on purpose cancels DDPG's optimism bias so the actor can't chase phantom value.
every d
② Delayed updates
Let the value settle
Update actor & targets only every d ≈ 2 critic steps. ⚠ So TD3 keeps 6 nets: actor + twin critics + a target copy of each (3 live + 3 targets).
𝒩 → ±c
③ Clipped noise
Local exploration
Add Gaussian noise to the action, clipped to ±c — try nearby actions, never wander far. Bounded enough to find better actions, never into nonsense.
TD3 — full pseudocode
Pseudocode · DDPG + the three tricks (① smoothing ② twin-min ③ delay)
# ── INITIALISE ────────────────────────────────────────── initialize actor μθ, critics Q_φ1, Q_φ2 # random θ, φ1, φ2 θ′θ ; φ1′ ← φ1 ; φ2′ ← φ2 # target nets = exact copies replay buffer 𝓓 ← ∅ ; t ← 0 for each environment step: a ← μθ(s) + 𝒩(0, σ) # explore s′, r, done ← env.step(a) 𝓓.append( (s, a, r, s′, done) ) ; s ← s′ ; t ← t + 1 sample B = {(s, a, r, s′)} ~ 𝓓 # ① target policy smoothing — clipped noise on target action ãμθ′(s′) + clip( 𝒩(0, σ̃), −c, +c ) # ▲ trick ① # ② clipped double-Q — take the MIN of the two target critics y ← r + γ · min( Q_φ1′(s′,ã), Q_φ2′(s′,ã) ) # ▲ trick ② L_Qi ← ( Q_φi(s,a) − y )2 for i=1,2 # update BOTH critics EVERY step (↓) φ1, φ2 ← Adam(∇L_Qi) # ③ delayed updates — actor & targets only every d steps if t mod d == 0: # ▲ trick ③ (delay) L_μ ← −Q_φ1(s, μθ(s)) # actor follows critic 1 only θ ← Adam(∇L_μ) φ1′ ← τ·φ1 + (1−τ)·φ1′ # Polyak soft-update φ2′ ← τ·φ2 + (1−τ)·φ2′ θ′ ← τ·θ + (1−τ)·θ′
nets used μθ, Q_φ1, Q_φ2 + targetsθ,φ updated Adam · Polyak▲ new in TD3 ① target smoothing ② clipped double-Q (min) ③ delayed actor/target updates

Line for line it is DDPG — with three surgical additions. The target action gets clipped noise so the critic can't latch onto a sharp spurious peak. The TD target uses min(Q1,Q2), deliberately under-shooting to cancel DDPG's optimism. The actor and all target nets update only every d (≈2) critic steps, letting the value estimate settle before the policy chases it. Everything else — replay buffer, bootstrapped targets, Polyak averaging — is inherited unchanged.

SAC — Soft Actor-Critic (Haarnoja et al., 2018)

DDPG works but is famously unstable — a bad seed can ruin a run. The diagnosis: exploration is bolted on as external noise, fighting the deterministic policy. SAC takes a different approach: build exploration into the objective itself.

The maximum-entropy RL objective
classic: J(π) = E [ Σ γt · rt ] max-ent: J(π) = E [ Σ γt · ( rt + α · H[π(·|st)] ) ] └──── entropy bonus, every step ────┘ "act as well as possible — and remain as random as possible while doing it"
SAC's key ideas
α·H[π]
Entropy bonus
Exploration, by design
Reward = return + α·H[π]. Paying the agent for being random (high entropy) keeps it exploring and stops σ collapsing to 0 — exploration is built into the objective, not bolted on.
2 × Q
Twin critics
Take the min
Learn two Q-networks and use min(Q₁, Q₂) as the target. The smaller estimate under-shoots, killing the overestimation bias (borrowed from TD3).
α*
Auto-tuned α
Learn the temperature
α isn't a hand-set knob — it's a learned parameter, gradient-descended to hold the policy's entropy at a target Htarget = −|A|. High early, low late.
μ,σ → a
Actor output → action
Squashed Gaussian
The actor outputs a mean μ(s) and std σ(s). Sample noise ε ∼ 𝒩(0,1), form n = μ + σ·ε, then squash: a = tanh(n) — into the valid action range. (Reparam trick lets ∇ flow through the sample.)
SAC's networks — what goes in, what comes out

SAC trains three learnable networks: one stochastic actor and two identical critics (plus slow-moving target copies of the critics). Note the key difference from PPO: the SAC critic takes both the state and the action as input — it scores a specific (s, a) pair, not just a state.

Actor — πθ(a|s)

stochastic Gaussian + tanh policy
state s policy net θ μ(s) σ(s)
INPUTstate s — the observation vector
OUTPUTa Gaussian's mean μ(s) and std σ(s); the action is the reparametrised, squashed sample a = tanh(μ + σ·ε), ε ∼ 𝒩(0,1)

Twin Critics — Qφ1, Qφ2(s,a) ×2

two identical Q-nets; target uses their min
state s action a Q-net φ Q(s,a)
INPUTstate s and action a, concatenated — it scores a specific pair
OUTPUTone scalar Q(s,a) — soft action-value. Two nets run in parallel; SAC trains on min(Q1, Q2) to fight overestimation.
Network architectures — SAC runs in discrete and continuous

SAC works in both the continuous-action and the discrete-action setting — the body of the nets is the same MLP, only the actor/critic heads change.

In the discrete-action setting

  • The actor takes a state and returns probabilities over actions  p(ai|s).
  • The critic takes a state and returns one Q-value per action  Q(s, ai).

In the continuous-action setting

  • The critic takes a state and an action vector → a scalar Q-value  Qθ(s, a).
  • The actor needs a distribution parameterisation. SAC uses a squashed Gaussian:  a = tanh(n),  n ∼ 𝒩(μφ, σφ).
  • So the actor returns μφ and σφ.
Updates — the whole loop in one box
SAC inner loop
sample (s, a, r, s') from replay buffer critic target: y = r + γ · ( min(Q1tg(s', ã), Q2tg(s', ã)) − α · log π(ã|s') ) where ãπθ(·|s') critic loss: LQi = E [ ( yQi(s,a) )2 ] actor loss: Lπ = E [ α · log π(a|s) − min(Q1, Q2)(s, a) ] with a = μ + σ·ε, ε ∼ 𝒩 α loss: Lα = E [ −α · ( log π(a|s) + Htarget ) ] // auto-tune temperature targets: θtg ← τ θ + (1−τ) θtg
SAC — full pseudocode
Pseudocode · off-policy, one update per environment step
# ── INITIALISE ────────────────────────────────────────── initialize actor πθ, critics Q_φ1, Q_φ2 # random θ, φ1, φ2 φ1tg ← φ1 ; φ2tg ← φ2 # target critics = exact copies initialize temperature α # learnable; H_target = −|A| replay buffer 𝓓 ← ∅ for each environment step: # ── ACT & STORE (off-policy) ─────────────────────── a ~ πθ(·|s) # a = tanh(μ + σ·ε), ε ∼ 𝒩(0,1) s′, r, done ← env.step(a) 𝓓.append( (s, a, r, s′, done) ) ; s ← s′ # ── LEARN from a random minibatch ────────────────── sample B = {(s, a, r, s′)} ~ 𝓓 ã′ ~ πθ(·|s′) ; logπ′ ← log πθ(ã′|s′) # critic target: twin-min minus entropy y ← r + γ·( min(Q_φ1tg(s′,ã′), Q_φ2tg(s′,ã′))α·logπ′ ) # ▲ entropy term L_Qi ← ( Q_φi(s,a) − y )2 for i=1,2 # regress BOTH critics (↓) # actor: maximise Q while staying random ã ~ πθ(·|s) ; logπ ← log πθ(ã|s) L_π ← α·logπmin(Q_φ1(s,ã), Q_φ2(s,ã)) # ▲ entropy-regularised actor # temperature: hold entropy at the target L_α ← −α·( logπ + H_target ) # ▲ new in SAC: auto-tuned α φi ← Adam(∇L_Qi) ; θ ← Adam(∇L_π) ; α ← Adam(∇L_α) φitg ← τ·φi + (1−τ)·φitg # Polyak soft update, τ ≈ 0.005
nets used πθ, Q_φ1, Q_φ2 + targetsupdated θ, φ1, φ2, α · Polyak▲ new in SAC max-entropy objective (α·log π) + auto-tuned temperature α + stochastic reparam actor

Unlike PPO's collect-then-optimise rhythm, SAC interleaves one gradient update per environment step and learns from a giant replay buffer of old transitions — that off-policy reuse is why it is so sample-efficient. Each step updates four things: both critics regress to the entropy-augmented target y, the actor climbs min(Q₁,Q₂) while keeping its entropy high, and the temperature α self-tunes to hold that entropy at Htarget. The target critics trail behind by Polyak averaging.

SAC in action — watch entropy & α fall as it learns

The same continuous-control tasks, now under SAC. Drag the episodes slider: early on the policy is very random (high entropy H[π], high temperature α) — late in training the auto-tuned α drops and the policy concentrates on good actions, without ever fully collapsing. The two meters track the slider in real time.

episodes trained ep0 Step0 Score0
H[π] entropy
2.00
α temperature
0.75

Auto-tuned α holds entropy near a target Htarget = −|A|; both fall as the episode count climbs — exploration → exploitation.

SAC's networks live this step — the stochastic actor π(a|s) outputs μ & σ (a squashed Gaussian, a = tanh(μ+σ·ε)); the twin critics Q₁,Q₂(s,a) score it (target = their min). blue = state, gold = action. σ shrinks as the episode slider climbs. Use ⏭ Step.

DDPG vs TD3 vs SAC at a glance
DDPGTD3SAC
PolicyDeterministic μ(s)Deterministic μ(s)Stochastic Gaussian + tanh
ExplorationExternal noise (fragile)External noise + target smoothingBuilt into objective
Critics1 Q-network2 (clipped min)2 (clipped min)
Overestimation fixTwin-min + target smoothingTwin-min + entropy
Actor update cadenceEvery stepDelayed (every d steps)Every step
TemperatureAuto-tuned α
Stability across seedsNotoriously brittleMuch improvedRobust
Practical default for continuous controlHistoricalStrong (deterministic)Current default
The family tree — how every algorithm descends from one idea

Two trunks grow from a single root. The value-based line (from the DQN lab) learns Q(s,a) and acts greedily; the policy-gradient line (this lab) learns π(a|s) directly. They meet in the middle for continuous control, where DDPG borrows DQN's replay/target tricks and bolts them onto an actor — then TD3 and SAC harden it. Solid arrows = direct successor; dashed = an idea borrowed across families; gold = today's go-to default.

VALUE-BASED · DQN LAB CONTINUOUS CONTROL · THIS LAB POLICY GRADIENT · THIS LAB replay · target · Q(s,a) actor-critic double-Q stochastic + entropy Q-learning1989 · TD control DQN2013 · value + replay Double DQN2015 · decouple max PER2015 · smart replay Dueling DDQN2016 · V + A streams REINFORCE1992 · Monte-Carlo PG + Baseline~1995 · subtract V(s) AC · A2C2016 · V-Network A3C2016 · async workers PPO2017 · clip · default DDPG2015 · deterministic μ(s) TD32018 · twin + delay SAC2018 · max-entropy
Value-based · DQN lab Policy gradient · this lab Continuous control · this lab Current default / SOTA direct successor borrowed idea
All ten algorithms — side by side
AlgorithmYearOn / off-policyActorCriticAction spaceKey idea
DQN2013OffQ(s,a)DiscreteDeep Q-learning + replay
REINFORCE1992Onπ(a|s)DiscreteMonte-Carlo policy gradient
REINFORCE+baseline~1995Onπ(a|s)V(s)DiscreteVariance reduction by V baseline
A2C / Actor-Critic2016OnπVDiscreten-step advantage, parallel envs
A3C2016OnπVDiscreteAsync workers, lock-free
PPO2017OnπVBothClipped ratio, multi-epoch reuse
DDPG2015Offμ(s) deterministicQ(s,a)ContinuousChain-rule policy gradient
TD32018Offμ(s) deterministic2 × QContinuousTwin Q + delayed updates + target smoothing
SAC2018OffStochastic π2 × QContinuousMax-entropy + twin Q + auto α
Playground — same policy, two environments

Pick an algorithm and an environment. The policy used is hand-coded to mimic what that algorithm would actually learn (random → competent → optimal). Both envs are continuous-action: Lunar Lander uses a 2-D Gaussian (main + side thrust), Pendulum uses a 1-D Gaussian (torque). Every algorithm here can handle both — that's the whole point of policy gradients.

Env
Algorithm
Train ep500 Step0 Return0.00 Action

Drag the Train ep slider to see what the policy looks like at different stages of training. The earlier the episode, the more random the policy.