Perceptron
The perceptron is the first learning machine: a single neuron that draws a straight line through the plane and moves that line every time it is wrong. Watching it run answers two questions at once — how a machine can learn from mistakes at all, and why one neuron was never going to be enough.
Intuition
A perceptron classifies a point x = (x₁, x₂) by computing a score
w·x + b and reading only its sign: positive side of the line, predict +1;
negative side, predict -1. The weights w are the line's normal vector —
w points at the positive side, and b slides the line away from the origin.
Learning is one move repeated: visit the points in order, and whenever a
point sits on the wrong side, pull the boundary toward being right about
it — add the point to the weights (w ← w + y·x, b ← b + y). On the
canvas this is literally drawn as vector addition: the pull y·x attaches
tip-to-tail to w, and the boundary swings to its new pose.
That's the whole algorithm. No loss function, no gradients — just "wrong? move toward right."
How It Works
One pass visits all n points in a fixed order. For each point:
- Score it:
score = w·x + b. - Check the side: the prediction is the sign of the score. A point is a
mistake when
y · score ≤ 0— note the≤: a score of exactly0(the point sits on the line) counts as wrong, which is why the very first check of every run, starting fromw = (0, 0), is always a mistake. - Update on mistakes only:
w ← w + y·xandb ← b + y. Correct points change nothing.
The run ends when a full pass makes zero mistakes — the line separates the data, and every future pass would do nothing.
The learning rate is fixed at η = 1, and that is not a simplification:
starting from w = 0, every state the run reaches is just "a sum of the
mistakes so far," so scaling η scales (w, b) by a constant — and a constant
scale never changes the sign of any score. η would change the printed
numbers, never a single prediction.
The convergence theorem. If any separating line exists with margin γ
(the distance from the line to the closest point) and the data fits in a ball
of radius R, the perceptron makes at most (R/γ)² updates — ever. The
bound doesn't depend on n, only on the geometry: a wide gap means a few
updates, a thin gap means many. The Narrow margin preset is the same
size as the default and the same rule, but its thin gap costs 13 updates
where the default needs 5.
The failure. If no separating line exists, the rule never stops — and on integer data it does something sharper than wandering: the weights stay bounded, so they live on a finite grid of states, and a deterministic rule that ever revisits a state will repeat forever. The XOR preset shows this exactly: the pass ledger ends pass 2 in precisely the state that ended pass 1, so passes 3, 4, 5… are foregone conclusions. This is the limit Minsky and Papert published in 1969 — one line cannot cut the XOR pattern — and it stalled neural-network research until multi-layer networks trained by backpropagation dissolved it: one hidden layer is enough to carve the plane into regions no single line can.
Step By Step
The default Separable dataset has 8 points: four +1 (indigo) in an
upper arc, four -1 (magenta) below, visited in index order.
Pass 1 — 3 mistakes:
- Point 0
(4, 2), label+1: score= 0— the tie rule makes it a mistake. Update:w = (4, 2),b = 1. The boundary appears. - Point 1
(-2, -3), label-1: score= -13✓. - Point 2
(-4, 2), label+1: score= -11— wrong side. Update:w = (0, 4),b = 2. The boundary swings nearly horizontal. - Point 3
(3, 0), label-1: score= 2— wrong side. Update:w = (-3, 4),b = 1. - Points 4–7 all check out ✓ (scores
11, -17, 19, -12).
Pass 2 — 2 mistakes:
- Point 0
(4, 2)again: score= -3— the pass-1 fixes broke it. Update:w = (1, 6),b = 2. - Point 3
(3, 0)again: score= 5— the near-boundary point forces a second correction. Update:w = (-2, 6),b = 1. - The other six points pass ✓.
Pass 3 — 0 mistakes: all eight scores have the right sign, the ledger
commits ✗ 0, and the run converges at w = (-2, 6), b = 1 after 5
updates total.
Complexity
- Per pass:
O(n)score computations, eachO(d)inddimensions — hered = 2. - Updates until convergence: at most
(R/γ)²on separable data, independent ofn. The margin, not the dataset size, is what you pay for. - Non-separable data: no bound exists — the loop runs forever (the animation proves its preset cycles rather than asserting it).
Edge Cases
score = 0— counted as a mistake by the≤in the rule. This is what bootstraps learning: the all-zero start misclassifies its first point by definition and the first update always fires.w = (0, 0)— no boundary exists (every point scores0). True at init, and briefly true mid-run on the XOR preset, where the canvas honestly shows the boundary vanishing.- A point exactly on the boundary mid-run is a mistake for whichever class it carries — the rule leaves no neutral zone.
- Duplicated or revisited mistakes: the same point may force updates in several passes (point 3 here does, twice) — each pull overshoots or undershoots until the geometry finally accommodates everyone.
Common Mistakes
- Believing convergence means a good line. The theorem guarantees a separator, not the widest one — the final line here hugs the negative class. Maximizing the margin is a different algorithm (the SVM).
- Blaming the learning rate. On XOR no η converges — the failure is the
data's geometry, not the step size. From
w = 0, η doesn't even change predictions. - Over-generalizing the bound.
(R/γ)²bounds updates, not passes, and only exists when the data is separable. - Reading the visit order as part of the algorithm. Any order works for the theorem; order changes which line you converge to and how the XOR cycle looks, not whether either happens.
A note on simplification: the animation's "it will cycle forever" verdict
comes from the visualization's bookkeeping — it watches the pass-end states
and flags the first repeat. The algorithm itself (the code panel) has no such
detector; it just runs out of passes and returns None.
What To Try
- Separable → Narrow margin: same rule, same 8 points' bounding box —
watch the mistake counts per pass (
3, 2, 0vs8, 5, 0) and feel the(R/γ)²bound as geometry. - XOR: watch the pass ledger. Pass 2 ends at the same
(w, b)as pass 1 — once you see the repeated row, you've proved it never converges, no faith required. - Watch point 3 on the default run: the near-boundary point is the one that forces corrections in both working passes. Margins are expensive where points crowd the line.
- Then see what replaced the single neuron: a multi-layer network's forward pass and the backpropagation rule that trains it.
These lessons are still being refined and may contain mistakes.