The natural sequel to the Deep RL lab. DQN learns a value and acts greedily — but what if the action space is continuous, or the optimal policy is stochastic? Meet REINFORCE, A2C, A3C, PPO, DDPG and SAC.
← Continues from Deep RL · DQN / PER / DDQN / DDDQNThe Deep RL lab ended at DDDQN — a Q-network that learns Q(s,a) for every action and picks argmaxa Q(s,a). That works beautifully on Atari (18 discrete buttons) and Cartpole (push left / push right). But the moment we touch a robot arm, a self-driving car, or anything with a steering wheel and a throttle, DQN breaks.
The trouble is the argmax. With a discrete action set you can enumerate every action and pick the largest Q. With a continuous action — a torque in [-2, +2] Newton-metres, or a steering angle in [-30°, +30°] — there are infinitely many actions. You cannot enumerate them.
Suppose the agent can't see its absolute position — it only senses its neighbouring tiles. On this lake the two purple tiles have the exact same view (ice on the left, ice on the right): they are aliased. To the network they are the same state.
A deterministic policy maps that one view to one action. But the goal is to the right of the left tile and to the left of the right tile — opposite directions! Whichever it commits to, one twin skates straight into a hole, doomed with 100% certainty.
A stochastic policy goes left/right 50/50 from that view, so neither start is doomed — exploration carries it to the goal from both. Two bonuses fall out for free:
Switch domains entirely: you're playing repeated Rock-Paper-Scissors against an opponent who watches your habits and best-responds. Any deterministic policy — “always ✊” — is instantly exploited: the opponent just plays ✋ forever and you lose every round.
The only unexploitable play is the mixed policy ⅓ ✊, ⅓ ✋, ⅓ ✌ — the game's Nash equilibrium. There is no single best action to argmax; the optimum is a distribution. A value-greedy DQN literally cannot represent that — a stochastic policy gets it for free.
Value methods fit the whole Q-surface and then act greedily — a lot of machinery if all you really want is to act well. Policy gradient differentiates the expected return through the policy and simply steps: θ ← θ + α ∇θ J(θ).
The heat-map is J(θ) over two policy parameters — brighter is higher return. The ball follows the gradient straight uphill to a peak. The honest catch (it's in the table just below): it converges to a local optimum, not always the global one. Hit new start θ a few times and watch it sometimes settle on the smaller hill.
DQN acts by argmaxa Q(s,a): score every action, take the largest bar. Fine for 18 Atari buttons. But the front-wheel angle of a car lives in [−30°, +30°] — an infinite set. There is nothing finite to loop over.
Discretising into bins is the usual hack, and you can drag the slider to feel the trap: coarse bins miss the true optimum (red ring), fine bins find it but explode — k bins over d action-dims = kᵈ outputs. A policy instead emits the parameters of a distribution (a Gaussian's μ, σ) and reads the best action off directly — no search at all.
No bins, no argmax. The network keeps the same body but ends in a single output neuron; a tanh squashes it to [−1, 1], then we scale by 30 to get the front-wheel angle in [−30°, +30°]. The continuous action is read straight off the head — and because it's a smooth function of θ, we can push gradients through it.
| Property | Value-based (DQN family) | Policy-based (this lab) |
|---|---|---|
| Learns | Q(s,a) | π(a|s) |
| Picks action via | argmax over Q | Sample from π |
| Discrete actions? | ✅ Native | ✅ Categorical π |
| Continuous actions? | ❌ argmax intractable | ✅ Gaussian / Tanh π |
| Stochastic policies? | ❌ Always greedy | ✅ Built-in |
Policy Gradient (PG) is the whole family this lab is built on, resting on one idea: raise the log-probability of actions that paid off, lower the ones that didn't. Let's first pin down the general estimator and the training loop every method here inherits — then build its simplest concrete instance, REINFORCE, and its variance-reducing upgrade, the baseline, further down this page.
Instead of fitting a value table, we parameterise the policy directly as πθ(a|s) — a neural network whose output is a probability over actions (or, for continuous control, the parameters of a distribution). We then compute the gradient of the expected return with respect to θ and walk uphill.
The magic is the log-likelihood trick: ∇ E[R] = E[∇ log π · R]. The right-hand side is a sample-able expectation — we just roll out the policy, multiply each log-prob by the return that followed, and that's an unbiased gradient estimate. No bootstrapping, no targets, no replay buffer.
We can't differentiate J(θ) directly, because θ sits inside the distribution we're averaging over, not inside the thing being averaged. The fix is one line of calculus — the log-derivative (score-function) trick:
Two things make this theorem so useful. First, the environment dynamics p(st+1|st,at) don't depend on θ, so when we take the log-gradient they drop out completely — we never need a model of the world. Second, the result is an expectation, so we estimate it by simply rolling out the policy and averaging ∇θ log πθ(a|s) weighted by the return R(τ).
Reading it intuitively: ∇θ log πθ(a|s) points in the direction that makes action a more likely; multiplying by R(τ) means good trajectories push their actions up and bad ones push theirs down. That is the entire idea behind every algorithm in this lab — REINFORCE is just this estimator with R(τ) = the Monte-Carlo return.
REINFORCE is the simplest possible algorithm that learns by walking up the policy-gradient hill. The recipe is just three lines:
📄 Original paper — Williams (1992), Simple statistical gradient-following algorithms for connectionist reinforcement learning.
So what is Gt? It's the return-to-go (a.k.a. remaining return): the total discounted reward collected from step t to the end of the episode. It's the score we hang on the action at — "given that I took this action here, how did the rest of the episode actually turn out?"
Gt = rt + γ rt+1 + γ2 rt+2 + … + γT−t rT
Two things to notice. The discount γ ∈ [0, 1) makes reward that arrives sooner count for more than far-future reward. And Gt sums only rewards from t onward — an action is never credited for reward that was already banked before it was taken (that's the "reward-to-go" idea).
In the update, Gt is simply the weight on each action's log-prob gradient: a big positive Gt shoves πθ(at|st) up, a negative one shoves it down. Because it's the actual sampled return — not an estimate — it's unbiased, but it swallows every random event for the rest of the episode, which is exactly why it's so noisy (the variance problem below).
This is a simulated REINFORCE training run on Lunar Lander — return per episode (light) and a 50-episode running mean (bold). Notice the massive episode-to-episode variance: a single unlucky rollout can drop the return by 200. That noise is precisely what the next tab fights.
Here is a beautiful mathematical fact: for any function b(s) that depends only on the state (not the action), subtracting it from the return inside the gradient is free — the expected gradient is unchanged.
So we can subtract anything that's a function of state. The best choice — the one that minimises variance — is the state-value function V(s). That gives us the advantage.
Red = vanilla REINFORCE. Green = REINFORCE-with-baseline. Same final return, same number of episodes — but the green curve is dramatically smoother. That smoothness translates directly into being able to use a larger learning rate, which translates into faster real-world learning.
The policy gradient is the same one REINFORCE used — every Actor-Critic method just changes the weight on ∇θ log π. REINFORCE weights it by the noisy Monte-Carlo return Gt; Actor-Critic swaps that for a bootstrapped estimate built from the critic's value V(s).
Outputs probabilities over actions (or a distribution mean/std for continuous). Updated by the policy-gradient with the critic's advantage.
Outputs a single scalar — the expected return from this state. Updated by TD regression to r + γ V(s').
The actor picks the action. The critic tells the actor whether the action was better or worse than expected. The actor takes a tiny gradient step accordingly. Round trip: ~5 ms. No full episode required.
Actor-Critic stitches the two families together: the actor is a policy-based learner, the critic is a value-based learner, and each one patches the other's weakness.
The actor is the policy πθ(a|s) — it selects actions and is what we ultimately deploy. The critic learns a value function Vφ(s) (or Qφ(s,a)) — it never picks actions; it just critiques the actor's choices with evaluative feedback.
Instead of waiting for the noisy Monte-Carlo return Gt, the actor reads the critic's value V(s) — its estimate of how good the current state is — and nudges its policy with that critique instead of a whole episode's return. Meanwhile the critic keeps fitting V(s) more accurately, so the signal the actor learns from gets sharper as training goes on.
That bargain buys three things at once: lower variance (the critic's bootstrapped estimate is far steadier than a whole-episode return), faster learning (we update every step instead of every episode), and a natural fit for continuous action spaces — exactly the limits of purely policy-based or purely value-based methods.
A2C makes one more upgrade: it replaces the bare value V(s) with the advantage At = Q(s,a) − V(s). Where V(s) only says "how good is this state on average?", At asks the sharper question: "how much better (or worse) was this action than the critic's average expectation for the state?"
That centring is what makes the gradient informative — actions that beat the baseline V(s) get pushed up, actions that fall short get pushed down — while keeping variance low.
The two updates happen every transition. The critic lines fit the value function; the actor line nudges the policy using the advantage At — which, for one-step Actor-Critic, is the TD error r + γV(s′) − V(s).
A2C's actor is a stochastic policy π(a|s): it outputs a distribution over actions and samples — unlike a value-greedy or deterministic policy a = μ(s) that always picks one fixed action. Why does that matter? When the best behaviour is to be unpredictable, or when many states look identical, a deterministic policy can be read, exploited, or trapped. Each game below pits a deterministic policy against a stochastic one — watch deterministic lose. Representing a distribution over actions is exactly the superpower a policy-gradient method like A2C gives you.
Now you play against the model. Pick the Deterministic opponent — it always throws ✊, so just play ✋ every time and you win every round. A fixed policy is a habit, and habits get exploited.
Switch to the Stochastic opponent (⅓ ✊, ⅓ ✋, ⅓ ✌ — the Nash equilibrium) and try again: there's no pattern to punish, so however you play your score just hovers around zero — unexploitable. The optimum here is a distribution, something a = μ(s) can't represent. (▶ Auto-play me plays the best counter for you.)
Before A2C, the same group published A3C — the same loss, but each worker runs its own copy of the env on its own CPU thread and pushes gradients to a shared parameter server without waiting. The result feels like SGD on a chaotic mini-batch — and somehow, it works.
Two workers pushing gradients at the same time will sometimes overwrite each other. This is called Hogwild! updating, and it provably converges as long as the updates are sparse-enough. In practice the workers are also de-synchronised — they're at different points in different episodes — so their gradients are diverse, which helps reduce correlation.
The big advantage of A3C over single-thread methods at the time was not faster gradients per second — it was that the diversity of trajectories acted like a replay buffer would in DQN, decorrelating updates and removing the need for one.
A3C is built around a single global network holding the shared parameters (θ for the actor, φ for the critic). Around it run many independent workers, each with its own copy of the networks and its own environment instance — so exploration happens in parallel across CPU threads.
each worker copies the latest global θ, φ into its local nets, then rolls out n steps in its own env.
it computes ∇L locally and fires it straight at the global network through a shared optimiser (RMSProp/Adam) — no lock, no waiting (Hogwild!).
workers sit at slightly different policy versions, so their experience is diverse — this replaces the replay buffer and stabilises training.
Adapted from APXML · Asynchronous Advantage Actor-Critic (A3C).
Vanilla policy gradient has a vicious failure mode. If a single update is too large, the policy can shift so far that the next batch of rollouts comes from a totally different distribution — and the gradient estimate from that batch becomes useless. The policy collapses, and you have to start over.
Trust-Region Policy Optimisation (TRPO) solved this with a second-order constraint on the KL divergence between old and new policies. Beautiful, but heavy. PPO is the same idea, but simple enough to fit on a napkin.
Two ideas, that's the whole trick:
PPO keeps A2C's whole backbone — a stochastic actor, a value critic, GAE advantages, parallel envs, the entropy bonus. It changes three things, and those three are what turn A2C into the on-policy default.
PPO is an actor-critic method: one network to act, one to judge. In code they usually share a torso and split into two heads, but conceptually they are two distinct functions — here is the name, input and output of each.
Heads-up — a different clip. PPO clips the probability ratio r = π/πold; a sibling continuous-control method, TD3, clips Gaussian noise instead — same word, a very different job. Here is that other clip, up close. Target policy smoothing replaces the single target action with a small noisy neighbourhood around it: ã = μθ'(s′) + clip( 𝒩(0, σ̃), −c, +c ). Two ideas stacked: a Gaussian for the noise, and a clip to bound it. The graph shows both — drag the sliders to see how the clip folds the Gaussian's long tails back onto ±c.
The critic is trained to fit Q at exactly the actor's target action. If that Q-surface has a sharp, spurious peak (function-approximation error), the deterministic actor steers straight into it. Sampling a little noise around the target action and averaging turns the target into a smooth local average of Q — a SARSA-like regulariser, so the policy can't exploit a one-pixel spike.
A raw Gaussian has unbounded tails — every so often it would throw the target action far from μ(s′), into a totally different action regime and poison the target with nonsense. Clipping to [−c, +c] keeps the smoothing local and controlled: average over a small neighbourhood, never over wild outliers. Smooth — but only a little.
The horizontal axis is the ratio r = π/π_old. The vertical axis is the PPO objective for one (s, a, A) sample.
When A > 0 (good action): the loss rises until r=1+ε, then flattens. Even if the network could make the action much more probable, PPO refuses to reward it.
When A < 0 (bad action): the loss falls until r=1−ε, then flattens. PPO refuses to push the action probability below (1−ε)·π_old.
The min ensures that the pessimistic bound applies — if the network tried to be optimistic, PPO clips it. The net effect: every update step is contained in a "trust region" of ratio space.
A2C does one gradient step per batch of rollouts. PPO does 10. Because the clip keeps the new π close to π_old, the rollouts stay roughly valid for the next epoch, and we squeeze 10× more learning out of every batch of expensive environment interactions. This is the real reason PPO is the modern default.
Two loops make PPO PPO. The outer loop collects a fresh batch and re-freezes πθ_old. The inner loop reuses that one batch for several epochs — safe only because the clip guarantees πθ never drifts far from πθ_old, so the off-policy ratio rt stays well-behaved. The actor climbs LCLIP, the critic regresses to Ĝt, and the entropy term keeps the distribution from collapsing too early.
Simulated learning curves on a Lunar-Lander-class env. Both algorithms reach the same final return — but PPO does it in ~⅓ the environment steps. The clip prevents catastrophic updates, the multiple epochs squeeze more juice from each batch, and the entropy bonus + GAE keep exploration alive throughout.
PPO isn't just for robots and games — it's the algorithm that aligned ChatGPT. After a language model is pretrained and supervised-fine-tuned, the final polish is RLHF — Reinforcement Learning from Human Feedback — and the optimiser at its heart is PPO (OpenAI's InstructGPT → ChatGPT recipe).
Supervised fine-tune the pretrained LM on human-written demonstrations of good answers. This becomes the starting policy πref.
Humans rank several model answers to the same prompt; train a reward model rφ to predict which answer a human prefers — a learned, automatic judge.
Fine-tune the LM with PPO to maximise that reward model's score — pushing the policy toward answers humans rate highly.
| RL concept | …in ChatGPT's RLHF |
|---|---|
| Policy πθ (actor) | the language model itself — it "acts" by emitting tokens |
State s | the prompt + every token generated so far |
Action a | the next token sampled from the vocabulary (a ~50k-way discrete action) |
| Episode / trajectory | generating one full response, token by token |
Reward r | the reward model's score of the finished answer — minus a per-token KL penalty |
| Critic Vφ | a value head on the LM estimating expected reward (feeds the advantage / GAE) |
When: the final training stage, after pretraining and SFT. This is the alignment step that turns a raw next-token predictor into an assistant that follows instructions, refuses harmful requests, and sounds helpful.
Why PPO and not vanilla policy gradient: the clip plus the explicit KL penalty keep the fine-tuned model from drifting far from the SFT model — without that leash the policy "reward-hacks" the reward model and collapses into repetitive gibberish that scores high but reads terribly. PPO's multi-epoch reuse also squeezes maximum learning from each batch of expensive reward-model-scored generations, and it stays stable at billion-parameter scale.
Note: newer alignment methods (e.g. DPO) skip the explicit PPO loop by optimising the preference data directly — but PPO was the original recipe behind InstructGPT and ChatGPT, and is still widely used for RLHF.
Everything so far was on-policy: roll out, learn, throw away. That's expensive when each environment step costs minutes (real robots, simulators). For continuous action spaces, can we go back to the DQN-style sample-efficient world of replay buffers and target networks?
DDPG says yes. The trick is to learn a deterministic policy μθ(s) = a — no distribution, just a function from state to a chosen action. Then the Q-function's gradient w.r.t. the action becomes the policy's gradient.
DDPG is actor-critic for continuous control. The crucial change from a stochastic policy: the actor is deterministic — it outputs one action, not a distribution. And like DQN, the critic scores a state-action pair, so its input includes the action.
Because the policy is deterministic and the updates use bootstrapped Q-targets, the data doesn't have to come from the current policy — so DDPG can borrow three off-policy stabilisers. Tap each card for how it works.
The whole loop runs off-policy: every environment step appends one transition to the replay buffer and does one update sampled from the entire history. The critic regresses to a bootstrapped target built from the target networks (the slow copies θ′, φ′), and the actor simply walks uphill on Q_φ(s, μθ(s)) — the chain rule turns ∇aQ into a gradient on θ. The fragile part is the 𝒩(0,σ) exploration noise bolted onto the action — exactly what SAC replaces with built-in entropy.
Continuous actions are everywhere: a spaceship's thrusters, a steering wheel + throttle, a push force, a joint torque. Pick a task in the tabs below and an episode budget, and watch the deterministic actor μ(s) go from flailing to fluent — the same algorithm, just a different continuous output.
DDPG is powerful but brittle: its single critic systematically over-estimates Q-values, the actor then exploits those bogus peaks, and training collapses. TD3 keeps DDPG's deterministic-actor backbone and fixes it with three small, surgical tricks — it's "DDPG done right".
TD3 keeps DDPG's deterministic actor but doubles the critic: two identical Q-nets (plus target copies of all three). The actor's input/output are unchanged from DDPG — the fix lives entirely in the critics, whose min kills the overestimation bias.
Line for line it is DDPG — with three surgical additions. ① The target action gets clipped noise so the critic can't latch onto a sharp spurious peak. ② The TD target uses min(Q1,Q2), deliberately under-shooting to cancel DDPG's optimism. ③ The actor and all target nets update only every d (≈2) critic steps, letting the value estimate settle before the policy chases it. Everything else — replay buffer, bootstrapped targets, Polyak averaging — is inherited unchanged.
DDPG works but is famously unstable — a bad seed can ruin a run. The diagnosis: exploration is bolted on as external noise, fighting the deterministic policy. SAC takes a different approach: build exploration into the objective itself.
SAC trains three learnable networks: one stochastic actor and two identical critics (plus slow-moving target copies of the critics). Note the key difference from PPO: the SAC critic takes both the state and the action as input — it scores a specific (s, a) pair, not just a state.
SAC works in both the continuous-action and the discrete-action setting — the body of the nets is the same MLP, only the actor/critic heads change.
Unlike PPO's collect-then-optimise rhythm, SAC interleaves one gradient update per environment step and learns from a giant replay buffer of old transitions — that off-policy reuse is why it is so sample-efficient. Each step updates four things: both critics regress to the entropy-augmented target y, the actor climbs min(Q₁,Q₂) while keeping its entropy high, and the temperature α self-tunes to hold that entropy at Htarget. The target critics trail behind by Polyak averaging.
The same continuous-control tasks, now under SAC. Drag the episodes slider: early on the policy is very random (high entropy H[π], high temperature α) — late in training the auto-tuned α drops and the policy concentrates on good actions, without ever fully collapsing. The two meters track the slider in real time.
Auto-tuned α holds entropy near a target Htarget = −|A|; both fall as the episode count climbs — exploration → exploitation.
| DDPG | TD3 | SAC | |
|---|---|---|---|
| Policy | Deterministic μ(s) | Deterministic μ(s) | Stochastic Gaussian + tanh |
| Exploration | External noise (fragile) | External noise + target smoothing | Built into objective |
| Critics | 1 Q-network | 2 (clipped min) | 2 (clipped min) |
| Overestimation fix | — | Twin-min + target smoothing | Twin-min + entropy |
| Actor update cadence | Every step | Delayed (every d steps) | Every step |
| Temperature | — | — | Auto-tuned α |
| Stability across seeds | Notoriously brittle | Much improved | Robust |
| Practical default for continuous control | Historical | Strong (deterministic) | Current default |
Two trunks grow from a single root. The value-based line (from the DQN lab) learns Q(s,a) and acts greedily; the policy-gradient line (this lab) learns π(a|s) directly. They meet in the middle for continuous control, where DDPG borrows DQN's replay/target tricks and bolts them onto an actor — then TD3 and SAC harden it. Solid arrows = direct successor; dashed = an idea borrowed across families; gold = today's go-to default.
| Algorithm | Year | On / off-policy | Actor | Critic | Action space | Key idea |
|---|---|---|---|---|---|---|
| DQN | 2013 | Off | — | Q(s,a) | Discrete | Deep Q-learning + replay |
| REINFORCE | 1992 | On | π(a|s) | — | Discrete | Monte-Carlo policy gradient |
| REINFORCE+baseline | ~1995 | On | π(a|s) | V(s) | Discrete | Variance reduction by V baseline |
| A2C / Actor-Critic | 2016 | On | π | V | Discrete | n-step advantage, parallel envs |
| A3C | 2016 | On | π | V | Discrete | Async workers, lock-free |
| PPO | 2017 | On | π | V | Both | Clipped ratio, multi-epoch reuse |
| DDPG | 2015 | Off | μ(s) deterministic | Q(s,a) | Continuous | Chain-rule policy gradient |
| TD3 | 2018 | Off | μ(s) deterministic | 2 × Q | Continuous | Twin Q + delayed updates + target smoothing |
| SAC | 2018 | Off | Stochastic π | 2 × Q | Continuous | Max-entropy + twin Q + auto α |
Pick an algorithm and an environment. The policy used is hand-coded to mimic what that algorithm would actually learn (random → competent → optimal). Both envs are continuous-action: Lunar Lander uses a 2-D Gaussian (main + side thrust), Pendulum uses a 1-D Gaussian (torque). Every algorithm here can handle both — that's the whole point of policy gradients.
Drag the Train ep slider to see what the policy looks like at different stages of training. The earlier the episode, the more random the policy.