Backpropagation

Intuition

A network makes a prediction, the prediction is wrong by some amount, and now every weight needs to know one thing: "if I nudge myself a little, does the error go up or down, and how fast?" That number is the weight's gradient, and backpropagation is the bookkeeping trick that computes it for every weight in a single backward sweep. The key realization is that you don't have to re-run the whole network once per weight. The error at the output can be reused as it flows backward: each layer takes the error signal arriving from the layer in front of it, multiplies in its own local derivative, and passes the result on. Forward you compute activations left to right; backward you compute gradients right to left, one multiplication per layer. That's the whole idea.

How It Works

Backprop is the chain rule from calculus, applied layer by layer. For a single weight w, the loss depends on w only through a chain — w affects a pre-activation z, which affects an activation a, which (eventually) affects the loss L. The chain rule says the gradient is the product of the local derivatives along that chain:

∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w

The clever part is grouping. Define each neuron's error signal δ = ∂L/∂z — how sensitive the loss is to that neuron's pre-activation. Once you have δ for a neuron, the gradient of any weight feeding into it is just δ × (the input on that wire). And you get the δ of an earlier layer from the δ of the later one by sending it back through the connecting weights and multiplying by the activation's derivative σ'. In matrix form that one step is δ_prev = (Wᵀ δ_next) ⊙ σ'(z) — the transpose of the forward weight matrix routes each error back along the wire it came in on, and is the element-wise product. (In this single-output toy W₂ is one row and δ_o a scalar, so the transpose-product is just scaling each weight by δ_o, which is why the code can write it as W2 * d_out.) So the backward pass is a loop: compute δ at the output, use it for the output weights, push δ back to the hidden layer, use it for the hidden weights. Finally, every weight takes one gradient-descent step, w ← w − η·∂L/∂w — the exact rule from the Gradient Descent lesson, now with the gradient that backprop just produced.

Step By Step

The default run uses the toy net (2 → 2 → 1, sigmoid) with input x = [0.50, 0.90] and target y = 1. The fixed parameters (shown on the canvas too, so every number below is reproducible) are W₁ = [0.15, 0.20; 0.25, 0.30] (row-major — one row per hidden neuron, so the flattened ∂L/∂W₁ reads [h1·x1, h1·x2, h2·x1, h2·x2]), b₁ = [0.35, 0.35], W₂ = [0.40, 0.45], b₂ = 0.60. Recall σ'(s) = s·(1 − s).

  1. Forward · hidden. Each hidden neuron computes a = σ(W₁·x + b₁), giving a = [0.647, 0.678].
  2. Forward · output. ŷ = σ(W₂·a + b₂) = 0.762.
  3. Loss. L = ½(ŷ − y)² = ½(0.762 − 1)² = 0.028. The prediction is below the target, so there is error to fix.
  4. δ at the output. δ_o = (ŷ − y)·σ'(z_o) = (−0.238)·(0.762·0.238) = −0.043. It's negative because raising ŷ would lower the loss.
  5. Output weights. ∂L/∂W₂ = δ_o · a = [−0.028, −0.029] — one gradient per hidden→output wire.
  6. δ back to the hidden layer. Send δ_o back through W₂ and multiply by each hidden neuron's σ': δ_h = (W₂·δ_o) ⊙ a(1−a) = [−0.004, −0.004]. This is the chain rule doing its one job — the output error became the hidden error.
  7. Input weights. The outer product of the hidden errors with the inputs gives one gradient per input→hidden wire: ∂L/∂W₁ = δ_h ⊗ x = [−0.002, −0.004, −0.002, −0.004] (the 2×2 matrix, shown row-major).
  8. Update. Every weight steps downhill: w ← w − η·∂L/∂w with η = 0.5. Because the gradients are negative, the weights increase, which pushes ŷ up toward the target on the next forward pass.
  9. Did it help? Re-running the forward pass with the updated weights gives ŷ ≈ 0.766 (up from 0.762) and L ≈ 0.027 (down from 0.028). One step makes a small dent; training repeats this whole loop thousands of times.

Watch the colour sweep: amber lights the neurons left→right on the forward beats, then orange lights them right→left on the backward beats, with each node's δ appearing beneath it. The header flips from forward → to ← backward and back to forward → for the closing beat that re-runs the network and shows the loss tick down — that reversal is the whole lesson.

Complexity

PassCost
ForwardO(number of weights)
BackwardO(number of weights) — same as forward
MemoryO(number of activations) — the forward values are cached for reuse

The headline fact: backprop computes all the gradients for the price of roughly one extra forward pass, no matter how many weights there are. That efficiency is what makes training large networks feasible.

Edge Cases

  • Already correct. If ŷ = y, then δ_o = 0, so every gradient is zero and no weight moves — you're at a minimum of the loss for this example.
  • Saturated neurons. Where a sigmoid is very close to 0 or 1, σ'(a) = a(1−a) is nearly zero, so the δ flowing through that neuron is throttled — the vanishing gradient.
  • Deeper networks. Each extra layer multiplies in another σ' factor, so small derivatives compound and gradients can vanish (or, with large weights, explode) — a real obstacle this toy net is too shallow to show.

Common Mistakes

  • Dropping the σ' term. The gradient at a neuron is the incoming error times the activation's local derivative. Forget σ' and you've broken the chain.
  • Using updated weights mid-backward. The backward pass must use the weights from the forward pass. Update them only after every gradient is computed, or later layers get the wrong δ.
  • Sign error. δ_o = (ŷ − y)·σ', and the update subtracts η·∂L/∂w. Flip a sign and you climb the loss instead of descending it.
  • Not caching activations. The backward pass reuses every forward activation (a, ŷ). Recomputing them per weight throws away the whole efficiency win.

A Note on Simplification

This is a deliberately tiny picture meant to make the chain rule visible — not a production training loop. A real network has thousands to billions of weights, not six; it backprops the average gradient over a mini-batch of examples, not one; it almost always uses ReLU (and normalization) instead of sigmoid to avoid the vanishing gradients noted above; and no one writes the backward lines by hand — frameworks build a computation graph and run automatic differentiation. We also hold the two biases fixed to keep the picture to six weights — in a real step they move too, by the identical rule ∂L/∂b = δ (a bias is just a weight whose input is always 1). But the per-weight move underneath all of that machinery is exactly what you see here: send the error backward, multiply in one local derivative per layer, then take a gradient-descent step.