Gradient Descent

Intuition

Imagine standing on a hillside in thick fog. You cannot see the valley floor, but you can feel which way the ground tilts under your feet. So you take a step straight downhill, feel the slope again, and step again — over and over — until the ground is flat and you have reached the bottom. Gradient descent is exactly this, made precise: the gradient is the direction the ground tilts, and the learning rate is how big a step you take each time. Nearly every modern neural network is trained this way — by rolling a number downhill on a curve of "how wrong am I?"

This lesson shrinks that idea to one parameter and one curve, the convex bowl f(θ) = θ². The height of the curve is the loss (the error); the position along it is the parameter θ we are free to change. The whole game is to find the θ that sits at the bottom of the bowl.

How It Works

At any point on the curve, the gradient f'(θ) is the slope of the tangent line — positive when the curve rises to the right, negative when it rises to the left. Because the slope points uphill, we move in the opposite direction to go down. That is the entire update rule:

θ ← θ − η · f'(θ)

The learning rate η (eta) scales the step. The gradient already tells us the direction and roughly how steep things are, so steps are naturally large when the curve is steep (far from the bottom) and shrink as it flattens out (near the bottom). For our bowl the slope is exactly f'(θ) = 2θ, so each update is θ ← θ − η·2θ = θ(1 − 2η).

That single factor (1 − 2η) decides everything, and it is why the lesson lets you switch between three learning rates:

  • η too small — each step barely shrinks θ. The ball crawls and never reaches the bottom in the steps we allow. Safe, but slow.
  • η just right — steady steps that march straight to the minimum.
  • η too large — the step jumps past the minimum to the other side. The ball zig-zags back and forth across the valley, wasting effort. Larger still and it climbs out of the bowl entirely — it diverges.

Step By Step

The default run uses the "just right" rate η = 0.25, starting at θ = 4:

  1. Start. θ = 4, loss f(4) = 16. The ball sits high on the right arm.
  2. Measure. The tangent appears: f'(4) = 2·4 = 8. A steep positive slope — downhill is to the left.
  3. Step. θ ← 4 − 0.25·8 = 2. The ball rolls left and down; loss drops to 4. Notice the step vector under the axis: it is large, because the slope was large.
  4. Measure. f'(2) = 4 — still positive, but gentler.
  5. Step. θ ← 2 − 0.25·4 = 1; loss 1.
  6. Iterations 3–6 repeat the same beat, and because η = 0.25 makes the factor (1 − 2η) = 0.5, each step simply halves θ: 1 → 0.5 → 0.25 → 0.125 → 0.0625. The loss falls toward zero and the ball settles in the bottom of the bowl — converged.

Watch two visual signals as the run plays: the tangent flattens every time the ball nears the minimum (the slope is shrinking), and the step vector under the axis gets shorter with it (smaller slope → smaller step). When both are nearly flat, you are at the answer. Then switch the learning rate to η = 0.85 and watch the ball overshoot — the trail of past positions hops from one side of the valley to the other.

Complexity

QuantityCost
Work per stepO(d) for d parameters (here d = 1)
Steps to convergedepends on η and the curvature, not on d
MemoryO(d) — just the current parameters

Gradient descent does not solve for the minimum in one shot; it approaches it iteratively. On a convex bowl like this one a well-chosen η converges quickly and reliably; the number of steps grows as η shrinks and can become infinite if η is large enough to diverge.

Edge Cases

  • Already at the minimum. If θ = 0, then f'(θ) = 0, the step is zero, and the ball does not move — there is no downhill left.
  • Very small gradient (flat region). Near the bottom the slope is tiny, so steps shrink and progress slows. This is convergence, but it is also why deep flat regions ("plateaus") can stall real training.
  • Learning rate at the stability edge. With f(θ) = θ², any η > 1 makes |1 − 2η| > 1, so the steps grow instead of shrink and the loss climbs — the run diverges no matter where it starts.
  • Non-convex surfaces. A real loss curve has bumps and several valleys. Gradient descent only feels the local slope, so it can settle into a local minimum that is not the global one. This bowl has a single minimum on purpose.

Common Mistakes

  • Dropping the minus sign. θ ← θ + η·f'(θ) walks uphill — gradient ascent. The minus is what makes it descend.
  • Treating one learning rate as universal. A rate that converges on one problem overshoots on another. In practice η is tuned, and often decreased over time with a schedule.
  • Reading "large gradient" as "almost done". A large gradient means the curve is steep, i.e. you are far from the minimum — the opposite of done.
  • Confusing the gradient with the loss. The loss is the height of the curve; the gradient is its slope. You minimize the loss by following the gradient, but they are different quantities.

A Note on Simplification

This is a deliberately simplified picture, meant to build intuition — not a substitute for a course on optimization. Real training optimizes millions of parameters at once (a surface in millions of dimensions, not a 1-D curve), estimates the gradient from noisy mini-batches of data rather than computing it exactly, navigates non-convex landscapes full of saddle points and local minima, and almost always uses momentum or adaptive methods (Adam) and a learning-rate schedule instead of the single fixed η shown here. The core move — measure the slope, step the other way — is genuinely the same; only the scale and the machinery around it grow.