Convolutional Neural Networks

Intuition

A regular neural network throws away the most useful fact about an image: that nearby pixels belong together. A convolutional network keeps it. Instead of one giant weight for every pixel, a CNN learns a tiny filter — a small grid of weights — and slides that same filter across the whole image, asking the same question at every location: "does the patch right here look like the pattern I detect?" A filter tuned to vertical edges lights up wherever a vertical edge appears, no matter where in the image it sits. That single idea — a small, reused detector swept across space — is what makes CNNs both efficient (one small filter, not millions of per-pixel weights) and translation-aware (the same feature is found anywhere). This lesson runs one such filter through the three operations that make up a convolution layer: convolve → ReLU → pool.

How It Works

A convolution places the filter over a small patch of the image, multiplies each pixel by the weight sitting on top of it, and adds the results into a single number. That number is one cell of the feature map — a measure of how strongly the patch matches the filter. Slide the filter one step right and repeat; when the row ends, drop down and continue. Because we never let the filter hang off the edge ("valid" convolution), a 6×6 image with a 3×3 filter produces a 4×4 feature map. The key is that the same nine weights are reused at every position — that weight sharing is the whole efficiency win.

Raw convolution outputs can be negative (a vertical-edge filter answers very negative at an edge of the opposite contrast). A ReLU — max(0, x) — clamps every negative response to zero, keeping only the locations that actually match the filter and giving the network its nonlinearity.

Finally, max-pooling shrinks the feature map. Slide a small window (here 2×2) across it taking only the largest value in each window, and move by the window's width (stride 2) so the windows don't overlap. This halves each dimension: the 4×4 map becomes 2×2. Pooling makes the representation smaller and a little position-invariant — the exact pixel of the strongest match stops mattering, only that it was somewhere in the window. Stack convolve → ReLU → pool a few times, each layer with many filters, and early layers learn edges while later layers combine them into shapes and objects.

Step By Step

The default run uses a 6×6 image with a bright vertical bar down the middle (columns 2–3 are 9, the rest 0) and the vertical-edge filter [[1, 0, −1], [1, 0, −1], [1, 0, −1]]. Each cell of that filter multiplies the left column of a patch by +1 and the right column by −1, so it measures left brightness minus right brightness.

Convolve, position (0, 0). The top-left 3×3 patch covers the dark left margin and the left edge of the bar. Left column is dark, right column is bright, so left − right is negative: the products sum to −18. The filter reports "a dark→bright edge here," the opposite of what it's tuned for.
Slide across. Move the window right one step at a time, filling one feature-map cell per position. Over the bar's right edge (bright → dark) the sum is +18 at the top and +27 in the middle rows. Over flat regions it is near zero. After all 16 positions the 4×4 feature map reads, row by row, [−18, −18, 18, 18], [−27, −27, 27, 27], [−27, −27, 27, 27], [−18, −18, 18, 18]. The same nine weights produced every one of those numbers — that is weight sharing in action.
ReLU. Apply max(0, x). The negative left-edge responses become 0; the positive right-edge responses survive. The map becomes [0, 0, 18, 18], [0, 0, 27, 27], [0, 0, 27, 27], [0, 0, 18, 18] — only the edge whose contrast matches the filter is kept.
Max-pool, 2×2 / stride 2. Take the largest value in each non-overlapping 2×2 window. The two left windows are all zeros → 0; the two right windows contain the 27s → 27. The pooled output is [[0, 27], [0, 27]]: a clean, half- size summary that says "a strong vertical edge runs down the right side."

Watch the amber window slide across the image while the feature map fills one cell at a time; watch the blue (negative) cells flip to empty on the ReLU beat; watch the 2×2 window halve the map on the pooling beats. Switch the filter to Horizontal edge in the panel to see the exact same machinery detect a horizontal bar instead — the feature map and pooled output simply transpose.

Complexity

Operation	Cost
Convolution	O(out_h · out_w · k²) — one k×k dot product per output cell
ReLU	O(feature-map size) — one comparison per cell
Max-pool	O(feature-map size) — one pass, each cell read once

The headline fact: the filter has only k² weights (here 9) no matter how large the image is, because those weights are reused at every position. That is why convolution scales to real images where a fully-connected layer would not.

Edge Cases

Valid vs padded. This lesson uses valid convolution (the filter never hangs off the image), so the output shrinks 6×6 → 4×4. Adding zero-padding around the border would keep the output the same size as the input.
All-flat region. Where every pixel in a patch is equal, left minus right is zero — the filter correctly reports "no edge here."
Negative responses. An edge of the opposite contrast gives a large negative convolution value; ReLU is what discards it. Without ReLU those would survive as spurious activations.
Pooling that doesn't divide evenly. With stride 2 over an odd-sized map the last window can fall off the edge; real implementations either pad or drop the remainder. The 4×4 → 2×2 case here divides cleanly.

Common Mistakes

Forgetting weight sharing. The filter is one small grid reused everywhere — not a different set of weights per position. Mistaking it for per-pixel weights loses the entire point of a CNN.
Skipping the nonlinearity. Without ReLU, stacking convolutions collapses into a single linear operation — the network could not learn anything a plain linear filter couldn't.
Confusing stride and window size. The window size sets how many cells each pooling step reads; the stride sets how far it jumps. Equal stride and size give non-overlapping windows (the usual downsampling case).
Reading the feature map as the image. The feature map is not a smaller picture — it is a map of filter responses. A bright cell means "the filter matched strongly here," not "this pixel was bright."

A Note on Simplification

This is a deliberately tiny picture meant to make the convolution operation visible — not a real CNN. A production network applies many filters per layer (not one), learns those filter weights by training (here they are hand-picked edge detectors), stacks many convolve→activate→pool layers, and adds machinery this view omits: padding, multiple input/output channels, batch normalization, and strided or dilated convolutions. It also runs on real images of hundreds of pixels per side, not a 6×6 toy. What is shown — slide a shared filter, take a dot product, apply a nonlinearity, then pool — is exactly the per-layer operation those larger networks scale up. Treat this as an explainer, not a substitute for a formal computer-vision course.