Convolutional Neural Networks

Intuition

A regular neural network throws away the most useful fact about an image: that nearby pixels belong together. A convolutional network keeps it. Instead of one giant weight for every pixel, a CNN learns a tiny filter — a small grid of weights — and slides that same filter across the whole image, asking the same question at every location: "does the patch right here look like the pattern I detect?" A filter tuned to vertical edges lights up wherever a vertical edge appears, no matter where in the image it sits. That single idea — a small, reused detector swept across space — is what makes CNNs both efficient (one small filter, not millions of per-pixel weights) and translation-aware (the same feature is found anywhere). This lesson runs one such filter through the three operations that make up a convolution layer: convolve → ReLU → pool.

How It Works

A convolution places the filter over a small patch of the image, multiplies each pixel by the weight sitting on top of it, and adds the results into a single number. That number is one cell of the feature map — a measure of how strongly the patch matches the filter. Slide the filter one step right and repeat; when the row ends, drop down and continue. Because we never let the filter hang off the edge ("valid" convolution), a 6×6 image with a 3×3 filter produces a 4×4 feature map. The key is that the same nine weights are reused at every position — that weight sharing is the whole efficiency win.

Raw convolution outputs can be negative (a vertical-edge filter answers very negative at an edge of the opposite contrast). A ReLUmax(0, x) — clamps every negative response to zero, keeping only the locations that actually match the filter and giving the network its nonlinearity.

Finally, max-pooling shrinks the feature map. Slide a small window (here 2×2) across it taking only the largest value in each window, and move by the window's width (stride 2) so the windows don't overlap. This halves each dimension: the 4×4 map becomes 2×2. Pooling makes the representation smaller and a little position-invariant — the exact pixel of the strongest match stops mattering, only that it was somewhere in the window. Stack convolve → ReLU → pool a few times, each layer with many filters, and early layers learn edges while later layers combine them into shapes and objects.

Step By Step

The default run uses a 6×6 image with a bright vertical bar down the middle (columns 2–3 are 9, the rest 0) and the vertical-edge filter [[1, 0, −1], [1, 0, −1], [1, 0, −1]]. Each cell of that filter multiplies the left column of a patch by +1 and the right column by −1, so it measures left brightness minus right brightness.

  1. Convolve, position (0, 0). The top-left 3×3 patch covers the dark left margin and the left edge of the bar. Left column is dark, right column is bright, so left − right is negative: the products sum to −18. The filter reports "a dark→bright edge here," the opposite of what it's tuned for.
  2. Slide across. Move the window right one step at a time, filling one feature-map cell per position. Over the bar's right edge (bright → dark) the sum is +18 at the top and +27 in the middle rows. Over flat regions it is near zero. After all 16 positions the 4×4 feature map reads, row by row, [−18, −18, 18, 18], [−27, −27, 27, 27], [−27, −27, 27, 27], [−18, −18, 18, 18]. The same nine weights produced every one of those numbers — that is weight sharing in action.
  3. ReLU. Apply max(0, x). The negative left-edge responses become 0; the positive right-edge responses survive. The map becomes [0, 0, 18, 18], [0, 0, 27, 27], [0, 0, 27, 27], [0, 0, 18, 18] — only the edge whose contrast matches the filter is kept.
  4. Max-pool, 2×2 / stride 2. Take the largest value in each non-overlapping 2×2 window. The two left windows are all zeros → 0; the two right windows contain the 27s → 27. The pooled output is [[0, 27], [0, 27]]: a clean, half- size summary that says "a strong vertical edge runs down the right side."

Watch the amber window slide across the image while the feature map fills one cell at a time; watch the blue (negative) cells flip to empty on the ReLU beat; watch the 2×2 window halve the map on the pooling beats. Switch the filter to Horizontal edge in the panel to see the exact same machinery detect a horizontal bar instead — the feature map and pooled output simply transpose.

Complexity

OperationCost
ConvolutionO(out_h · out_w · k²) — one k×k dot product per output cell
ReLUO(feature-map size) — one comparison per cell
Max-poolO(feature-map size) — one pass, each cell read once

The headline fact: the filter has only weights (here 9) no matter how large the image is, because those weights are reused at every position. That is why convolution scales to real images where a fully-connected layer would not.

Edge Cases

  • Valid vs padded. This lesson uses valid convolution (the filter never hangs off the image), so the output shrinks 6×6 → 4×4. Adding zero-padding around the border would keep the output the same size as the input.
  • All-flat region. Where every pixel in a patch is equal, left minus right is zero — the filter correctly reports "no edge here."
  • Negative responses. An edge of the opposite contrast gives a large negative convolution value; ReLU is what discards it. Without ReLU those would survive as spurious activations.
  • Pooling that doesn't divide evenly. With stride 2 over an odd-sized map the last window can fall off the edge; real implementations either pad or drop the remainder. The 4×4 → 2×2 case here divides cleanly.

Common Mistakes

  • Forgetting weight sharing. The filter is one small grid reused everywhere — not a different set of weights per position. Mistaking it for per-pixel weights loses the entire point of a CNN.
  • Skipping the nonlinearity. Without ReLU, stacking convolutions collapses into a single linear operation — the network could not learn anything a plain linear filter couldn't.
  • Confusing stride and window size. The window size sets how many cells each pooling step reads; the stride sets how far it jumps. Equal stride and size give non-overlapping windows (the usual downsampling case).
  • Reading the feature map as the image. The feature map is not a smaller picture — it is a map of filter responses. A bright cell means "the filter matched strongly here," not "this pixel was bright."

A Note on Simplification

This is a deliberately tiny picture meant to make the convolution operation visible — not a real CNN. A production network applies many filters per layer (not one), learns those filter weights by training (here they are hand-picked edge detectors), stacks many convolve→activate→pool layers, and adds machinery this view omits: padding, multiple input/output channels, batch normalization, and strided or dilated convolutions. It also runs on real images of hundreds of pixels per side, not a 6×6 toy. What is shown — slide a shared filter, take a dot product, apply a nonlinearity, then pool — is exactly the per-layer operation those larger networks scale up. Treat this as an explainer, not a substitute for a formal computer-vision course.