Self-Attention

Intuition

A word's meaning depends on the words around it. Bank by a river is a slope of earth; bank near savings is a place for money. Self-attention is how a Transformer lets every token look at every other token and pull in the parts that matter, so each token's vector becomes context-aware.

The mental model: each token asks a question (its Query), every token wears a label (its Key), and carries some content (its Value). A token compares its question against all the labels, decides how much to listen to each token, and then mixes their content together in those proportions. "Pay attention to" is literal — the weights are a distribution that says where to look.

How It Works

Start with a row of token vectors X. Three learned matrices turn each token into three different vectors:

Query q — what this token is looking for.
Key k — what this token offers to be matched against.
Value v — what this token contributes if it's attended to.

For one token, attention runs in three moves:

Score. Compare its Query to every Key with a dot product, q · k. A bigger dot product means "these two are more relevant." Divide every score by the square root of the vector dimension, √dₖ. That scaling keeps the numbers from growing with dimension and blowing up the next step.
Softmax. Turn the raw scores into attention weights — positive numbers that sum to 1. Now they read as "spend 69% of my attention here, 14% there."
Blend. Add up every token's Value, each multiplied by its weight. The result is the token's new vector: a weighted summary of what it attended to.

Run those three moves for every token at once and you get the attention matrix — row i is token i's distribution over all the tokens.

Step By Step

The default sentence is by the river bank, and we follow the query for bank (the same walk the animation shows).

Tokens. Four tokens, each a small vector.
Project. Every token sprouts a Query, a Key, and a Value. Bank's query comes out as [2.3, 2.3, 0, 0] — it's looking for a content word.
Score each key. Dot bank's Query with every Key, then divide by √dₖ = 2:
- by → 0.46
- the → 0.0
- river → 2.3 ← much larger: bank's query lines up with river's key
- bank (itself) → 0.69
Softmax. Those scores become weights that sum to 1: [0.11, 0.07, 0.69, 0.14]. Bank attends 69% to river, and only a little to the rest.
Blend. Multiply each token's Value by its weight and add them up. River's Value dominates, so bank's new vector lands at about [1.43, 0.06, 0.14, 0] — the "geographic" direction. Bank now carries river.

Switch the context tab to at the savings bank and the exact same bank embedding instead attends to savings, and its output blends in the "financial" Value. Nothing about the word changed — only what it could attend to.

Watch the matrix row fill in: numbers during scoring, then heat after softmax. The thick violet ray on the attend beat points straight at the token bank decided to listen to.

Try the Diagnostics

The Normalization selector lets you break the formula on purpose and watch what happens — the fastest way to feel why it's built this way. Only Scaled is real attention; the other two are labelled as experiments on the canvas.

Scaled — the real formula above: softmax(scores / √dₖ).
No √dₖ — drop the scaling before softmax. The dot products are large, so softmax saturates: nearly all the weight collapses onto one token (a near one-hot spike). That is exactly the failure the √dₖ prevents.
Uniform — ignore the scores and force every weight to 1/n. Bank's output becomes a featureless average of all four Values: no disambiguation — the baseline of a head that learned nothing.

The full matrix also shows that attention is not symmetric: on the parallel beat the highlighted pair makes it concrete — bank → river and river → bank are different numbers. Each row is its own distribution, so "i attends to j" never implies the reverse.

Complexity

Step	Cost
Project Q, K, V	O(n · d²)
Scores `Q Kᵀ`	O(n² · d)
Softmax	O(n²)
Blend `weights · V`	O(n² · d)

With n tokens of dimension d, attention is O(n² · d) — every token is compared with every other token. That quadratic in n is exactly why long context windows are expensive, and why much Transformer research is about making attention cheaper.

Edge Cases

A token attends to itself. Self-attention includes the diagonal — here bank keeps 14% on itself. That's normal, not a bug.
Uniform attention. If every score were equal, softmax would return 1/n everywhere — the output becomes the plain average of all Values.
The √dₖ scaling matters. Without it, large dot products push softmax into a near one-hot spike, and gradients during training nearly vanish.
Weights always sum to 1. Softmax is taken across the keys (one row), so each token spends exactly 100% of its attention — never more, never less.

Common Mistakes

Confusing the three roles. Query, Key, and Value are three different projections of the same token, not one vector reused. A token's query need not match its own key.
Softmax over the wrong axis. Normalize across the keys (each query's row sums to 1), not across queries. Getting the axis wrong silently breaks the whole layer.
Forgetting to scale. Dropping the / √dₖ is a real bug, not a rounding detail — it changes how sharp the attention becomes.
Reading the matrix backwards. Row i, column j is "how much token i (the query) attends to token j (the key)." The matrix is not symmetric: bank attending to river doesn't mean river attends equally to bank.

A Note on Simplification

This is a deliberately tiny, hand-built example, not a trained model. To keep the vectors readable as numbers, the dimension is d = 4 (so √dₖ = 2), there is a single attention head, and the embeddings and projection weights were chosen by hand so the bank → river / bank → savings pattern is clear. The mechanism — scaled dot-product attention — is exactly the real one. What's scaled up in practice is everything around it: real Transformers use dimensions of 512 or more, many heads in parallel (each learning a different pattern), and dozens of stacked layers, all with weights learned from data. Treat this as an explainer for how one head works, not a substitute for the full architecture.

These lessons are still being refined and may contain mistakes.