Backpropagation
Intuition
A network makes a prediction, the prediction is wrong by some amount, and now every weight needs to know one thing: "if I nudge myself a little, does the error go up or down, and how fast?" That number is the weight's gradient, and backpropagation is the bookkeeping trick that computes it for every weight in a single backward sweep. The key realization is that you don't have to re-run the whole network once per weight. The error at the output can be reused as it flows backward: each layer takes the error signal arriving from the layer in front of it, multiplies in its own local derivative, and passes the result on. Forward you compute activations left to right; backward you compute gradients right to left, one multiplication per layer. That's the whole idea.
How It Works
Backprop is the chain rule from calculus, applied layer by layer. For a
single weight w, the loss depends on w only through a chain — w affects a
pre-activation z, which affects an activation a, which (eventually) affects
the loss L. The chain rule says the gradient is the product of the local
derivatives along that chain:
∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w
The clever part is grouping. Define each neuron's error signal
δ = ∂L/∂z — how sensitive the loss is to that neuron's pre-activation. Once
you have δ for a neuron, the gradient of any weight feeding into it is just
δ × (the input on that wire). And you get the δ of an earlier layer from the
δ of the later one by sending it back through the connecting weights and
multiplying by the activation's derivative σ'. In matrix form that one step is
δ_prev = (Wᵀ δ_next) ⊙ σ'(z) — the transpose of the forward weight matrix
routes each error back along the wire it came in on, and ⊙ is the
element-wise product. (In this single-output toy W₂ is one row and δ_o a
scalar, so the transpose-product is just scaling each weight by δ_o, which is
why the code can write it as W2 * d_out.) So the backward pass is a loop:
compute δ at the output, use it for the output weights, push δ back to the
hidden layer, use it for the hidden weights. Finally, every weight takes one
gradient-descent step, w ← w − η·∂L/∂w — the exact rule from the Gradient
Descent lesson, now with the gradient that backprop just produced.
Step By Step
The default run uses the toy net (2 → 2 → 1, sigmoid) with input
x = [0.50, 0.90] and target y = 1. The fixed parameters (shown on the
canvas too, so every number below is reproducible) are W₁ = [0.15, 0.20; 0.25, 0.30] (row-major — one row per hidden neuron, so the flattened ∂L/∂W₁ reads
[h1·x1, h1·x2, h2·x1, h2·x2]), b₁ = [0.35, 0.35], W₂ = [0.40, 0.45],
b₂ = 0.60. Recall σ'(s) = s·(1 − s).
- Forward · hidden. Each hidden neuron computes
a = σ(W₁·x + b₁), givinga = [0.647, 0.678]. - Forward · output.
ŷ = σ(W₂·a + b₂) = 0.762. - Loss.
L = ½(ŷ − y)² = ½(0.762 − 1)² = 0.028. The prediction is below the target, so there is error to fix. - δ at the output.
δ_o = (ŷ − y)·σ'(z_o) = (−0.238)·(0.762·0.238) = −0.043. It's negative because raisingŷwould lower the loss. - Output weights.
∂L/∂W₂ = δ_o · a = [−0.028, −0.029]— one gradient per hidden→output wire. - δ back to the hidden layer. Send
δ_oback throughW₂and multiply by each hidden neuron'sσ':δ_h = (W₂·δ_o) ⊙ a(1−a) = [−0.004, −0.004]. This is the chain rule doing its one job — the output error became the hidden error. - Input weights. The outer product of the hidden errors with the inputs
gives one gradient per input→hidden wire:
∂L/∂W₁ = δ_h ⊗ x = [−0.002, −0.004, −0.002, −0.004](the 2×2 matrix, shown row-major). - Update. Every weight steps downhill:
w ← w − η·∂L/∂wwithη = 0.5. Because the gradients are negative, the weights increase, which pushesŷup toward the target on the next forward pass. - Did it help? Re-running the forward pass with the updated weights gives
ŷ ≈ 0.766(up from0.762) andL ≈ 0.027(down from0.028). One step makes a small dent; training repeats this whole loop thousands of times.
Watch the colour sweep: amber lights the neurons left→right on the forward
beats, then orange lights them right→left on the backward beats, with each
node's δ appearing beneath it. The header flips from forward → to
← backward and back to forward → for the closing beat that re-runs the
network and shows the loss tick down — that reversal is the whole lesson.
Complexity
| Pass | Cost |
|---|---|
| Forward | O(number of weights) |
| Backward | O(number of weights) — same as forward |
| Memory | O(number of activations) — the forward values are cached for reuse |
The headline fact: backprop computes all the gradients for the price of roughly one extra forward pass, no matter how many weights there are. That efficiency is what makes training large networks feasible.
Edge Cases
- Already correct. If
ŷ = y, thenδ_o = 0, so every gradient is zero and no weight moves — you're at a minimum of the loss for this example. - Saturated neurons. Where a sigmoid is very close to 0 or 1,
σ'(a) = a(1−a)is nearly zero, so theδflowing through that neuron is throttled — the vanishing gradient. - Deeper networks. Each extra layer multiplies in another
σ'factor, so small derivatives compound and gradients can vanish (or, with large weights, explode) — a real obstacle this toy net is too shallow to show.
Common Mistakes
- Dropping the
σ'term. The gradient at a neuron is the incoming error times the activation's local derivative. Forgetσ'and you've broken the chain. - Using updated weights mid-backward. The backward pass must use the weights
from the forward pass. Update them only after every gradient is computed, or
later layers get the wrong
δ. - Sign error.
δ_o = (ŷ − y)·σ', and the update subtractsη·∂L/∂w. Flip a sign and you climb the loss instead of descending it. - Not caching activations. The backward pass reuses every forward activation
(
a,ŷ). Recomputing them per weight throws away the whole efficiency win.
A Note on Simplification
This is a deliberately tiny picture meant to make the chain rule visible — not a
production training loop. A real network has thousands to billions of weights,
not six; it backprops the average gradient over a mini-batch of examples,
not one; it almost always uses ReLU (and normalization) instead of sigmoid
to avoid the vanishing gradients noted above; and no one writes the backward
lines by hand — frameworks build a computation graph and run automatic
differentiation. We also hold the two biases fixed to keep the picture to
six weights — in a real step they move too, by the identical rule
∂L/∂b = δ (a bias is just a weight whose input is always 1). But the per-weight
move underneath all of that machinery is exactly what you see here: send the
error backward, multiply in one local derivative per layer, then take a
gradient-descent step.