Training Loop
Every neural network you have ever heard of was produced by the same three-line loop: run the data forward, measure how wrong the output is, push the blame backward, nudge every weight a little, repeat. This lesson puts that loop in your browser and lets you drive it. A tiny handwritten network — two inputs, a small hidden layer, one output — trains live on 2-D points while you watch its decision surface morph and its loss curves fall (or refuse to).
The point is not the loop itself — you already met its parts in Gradient Descent and Backpropagation. The point is what the loop does under different conditions. The same code, fed the same data, lands in four distinguishable outcomes depending only on the learning rate, the model's capacity, and the noise in the data: it converges, it thrashes, it underfits, or it overfits. And the pair of loss curves — train and test — is enough to diagnose which one you are in.
Intuition
Think of the colored field behind the points as the network's current opinion: blue where it would answer "positive", magenta where it would answer "negative", white where it genuinely can't say. At epoch 0 the field is noise, because the weights are random. Every epoch, gradient descent nudges the weights to make the training points a little less wrong, and the field shifts accordingly. Watching the field is watching the network think out loud.
The two curves on the right are the honest scoreboard. The solid line is the loss on the 40 points the network trains on. The dashed line is the loss on 20 points it has never been allowed to touch. The solid line almost always falls — the loop is literally minimizing it. The dashed line is the one that tells the truth about learning: it falls only while the network is picking up real structure, and it turns back up the moment the network starts memorizing noise instead.
How It Works
One epoch is one trip through the loop:
- Forward — every training point flows through the net:
h = tanh(W1·x + b1), thenp = sigmoid(W2·h + b2), a probability that the point is class "+". - Loss — binary cross-entropy compares every
pagainst the true label: confidently right is cheap, confidently wrong is very expensive. - Backward — the chain rule (exactly the bookkeeping from the Backpropagation lesson) assigns every weight its share of the blame.
- Update — every weight steps against its gradient, scaled by the learning rate η (the step from the Gradient Descent lesson).
- Eval — the loop also scores the held-out test points, which never influence the gradients. That number is pure observation.
The playground runs this loop a few epochs per animation frame, full-batch (all 40 training points per update), and paints the field by running the same forward pass over a grid of locations. Everything is deterministic: the data, the initial weights, and therefore the whole run — Reset replays the identical experiment.
Step By Step
The default experiment (XOR · η = 0.3 · hidden = 4 · clean) is the perceptron's cliffhanger resolved. That lesson proved a single line must cycle forever on XOR. Add one hidden layer of four tanh neurons and watch the wall fall:
- Epoch 0 — random weights: the field is a faint wash, the boundary
region meaningless. Train and test loss start around
ln 2 ≈ 0.69— the loss of pure guessing. - Early epochs — both curves drop fast as the net finds the rough layout. The field first splits the plane crudely in half — for a moment it looks like a linear classifier, because that's the easiest progress.
- The carve — then the field bends: the four tanh neurons combine into a curved frontier that gives the ++ and −− quadrants to one class and the mixed quadrants to the other. This shape is exactly what no single line can draw.
- Settling — both curves flatten near zero, train and test accuracy read 100%. The run auto-pauses at epoch 878, when the train loss has stayed below 0.02 for 200 straight epochs (a visualization window — real training stops on iteration budgets or validation loss, not on a pretty threshold).
Then run the other three experiments — each changes exactly one knob:
- Too hot (η = 30): the loss spikes and crashes chaotically and the field flickers. Note what it does not do here: explode to infinity. With cross-entropy on a sigmoid output the gradients are bounded, so a too-hot run thrashes around the landscape instead of leaving it — unlike the quadratic bowl in Gradient Descent, where η past the bound provably blows up. Instability wears different costumes on different losses.
- Too simple (no hidden layer): a linear model on XOR. Both curves plateau at ≈ 0.69 and accuracy hovers near guessing, forever. Nothing is wrong with the optimization — capacity is the binding constraint. This is the 1969 wall, now with instrumentation.
- Overfit (circle · noisy · 8 neurons): the fork. Test loss bottoms out within the first few hundred epochs, then climbs and climbs while train loss keeps falling to zero — the net is spending its surplus capacity memorizing noise. Final score: train accuracy 100%, test accuracy ~75%. The best network was the one a few hundred epochs ago, which is why real training watches validation loss and stops early.
Complexity
One epoch costs O(n · H) — 40 points through a handful of neurons, a few thousand multiplications. That's why this playground trains in real time: the entire 4000-epoch budget is on the order of a hundred million flops, less than a phone spends decoding one video frame. Real training runs the same loop with billions of parameters and minibatches instead of the full batch — the loop survives the scale-up unchanged; only the bookkeeping around it grows.
Edge Cases
- Linear mode draws a crisp line. With no hidden layer the p = 0.5
frontier is exactly
w·x + b = 0, so the playground draws the literal line over the field — the same object the perceptron lesson rotates. - The spiral is a capacity ladder. Linear scores ~50% (chance), 2 neurons strain, 8 neurons largely crack it. Capacity buys curvature, one neuron at a time.
- A slow run is not a stuck run. At η = 0.03 the curves are still falling when the 4000-epoch cap lands. Distinguish it from "too simple" by the slope: slow-but-falling vs flat-forever.
- Same seed, same run. The datasets and initial weights are seeded, the updates are full-batch, so there is no randomness anywhere — every configuration always produces the identical trajectory.
Common Mistakes
- Reading train loss as learning. Train loss falling is the loop doing its job; test loss falling is learning. Only the pair tells you which.
- Treating "more neurons" as monotonically better. On the noisy circle, 8 neurons does worse on test data than a smaller net would — surplus capacity plus noise is precisely the overfitting recipe.
- Expecting too-hot to mean explosion. On this loss it means thrashing. If you want to see a genuine blow-up, that's the diverge tab in Gradient Descent — different loss surface, different failure costume.
- Reading the auto-pause as convergence theory. "Loss < 0.02 for 200 epochs" is a window this visualization chose so runs end; it is not a principled stopping criterion.
A Note on Simplification
This is a 2-D toy trained with plain full-batch gradient descent, so every phenomenon is visible in seconds. Real training differs in machinery, not in kind: minibatches (gradients are noisy estimates), adaptive optimizers like Adam (per-weight step sizes), regularization and early stopping (the standard defenses against the overfitting you just triggered on purpose), and millions of parameters in place of a dozen. The loop you watched — forward, loss, backward, update, evaluate honestly — is the same one, unchanged, all the way up to the frontier models.
These lessons are still being refined and may contain mistakes.