Forward Propagation
Intuition
A neural network is a stack of layers that turns an input into a prediction. You feed numbers in at one end, each layer reshuffles and recombines them, and the last layer reports a score for every possible answer. Forward propagation is simply that left-to-right journey: push the input forward through the layers and read off what comes out.
The network here recognizes handwritten digits. The input is a 28×28 grayscale image — 784 brightness values. The output is ten numbers, one per digit 0–9. The biggest one is the network's guess.
How It Works
The architecture is 784 → 25 → 25 → 10: an input layer of 784 pixels, two
hidden layers of 25 neurons each, and an output layer of 10.
Every neuron does the same small computation in two steps. First the
weighted sum (the pre-activation z = w·a + b): scale each input by a
weight, add them up, add a bias — so z can be negative. Then the ReLU
nonlinearity, which clamps negatives to zero (max(0, z)). The animation shows
these as separate beats: a neuron's z first (dark for positive, blue for
negative), then ReLU fading the blue ones to empty. A weight is the strength of
one connection: positive weights (drawn green) excite the next neuron, negative
weights (drawn magenta) inhibit it.
The output layer skips ReLU. Its ten raw scores (called logits) go through
softmax, which squashes them into probabilities that add up to 1. The
predicted digit is the argmax — the neuron with the highest probability.
This is why a network's power scales with its connections. Each line on the canvas is one weight; the more there are, the more complex the patterns it can represent. (The 784 → 25 connections into the first hidden layer — 19,600 of them — are too dense to draw, so only a representative subset between the later layers is shown.)
Step By Step
For the default input — a handwritten 7:
- Input. The 28×28 image is flattened into 784 pixel values. Dark cells (the ink) carry high values; the white background is near zero.
- Layer 1 — weighted sum. Each of the 25 neurons computes
z₁ = x·W₁ + b₁over all 784 pixels. About ten come out negative (blue), the rest positive (dark). - Layer 1 — ReLU.
max(0, z₁)zeros the negative neurons — they fade to empty — while the positive ones keep their value as the activationsa₁. - Layer 2 — weighted sum. The drawn weighted connections from layer 1
appear, and each layer-2 neuron shows
z₂ = a₁·W₂ + b₂(signed again). - Layer 2 — ReLU. Negatives are zeroed once more, giving
a₂. - Output (logits). Ten raw scores
z₃ = a₂·W₃ + b₃appear, one per digit. The neuron for7has the highest score; the others trail behind. - Predict. Softmax sharpens the logits into probabilities. The
7neuron wins decisively (≈100%), so the network predicts 7.
Switch to Ambiguous for any digit to see a less certain image: the same seven beats run, but softmax stays less one-sided — instead of collapsing onto one digit it leaves a visible runner-up, a more honest picture of how the network reasons under doubt.
Complexity
Inference is one pass, and its cost is dominated by the weight matrices:
784·25 + 25·25 + 25·10 ≈ 20,475 multiply-adds. In general the work scales as
O(connections) — there is no looping or backtracking, just one
left-to-right sweep.
Edge Cases
- Saturated vs. uncertain. On a clean digit, softmax collapses to ~100% for one class. On an ambiguous one, the probability is split — both are correct forward passes, only the confidence differs.
- Dead neurons. ReLU outputs exactly zero whenever its weighted sum is negative, so many hidden neurons stay dark for any given image. That is normal, not a bug.
- Wrong answers exist. A real network misclassifies some inputs. The images in this lesson are curated so the network gets every one right.
Common Mistakes
- Reading meaning into hidden neurons. Individual hidden units rarely correspond to anything a human would name. Only the output layer is interpretable.
- Dropping the nonlinearity. Without ReLU, stacking layers collapses into a single linear map — the two hidden layers would buy you nothing.
- Confusing logits with probabilities. Logits are unbounded raw scores; only after softmax do they become a distribution you can read as confidence.
A Note on Simplification
This lesson visualizes a deliberately small network — a two-hidden-layer MLP (25 + 25 hidden units) trained on MNIST — and draws only a representative subset of its connections. Production digit recognizers are larger and add machinery this view omits for clarity: convolutional layers, regularization (dropout / weight decay), batch normalization, and far more neurons. What is shown — weighted sums, a nonlinearity, and softmax — is real, and it is exactly the mechanism those bigger networks scale up. Treat this as an explainer, not a substitute for a formal machine-learning course.