Transformer (4/9) — Scaled Dot-Product Attention

The full attention formula is five characters: softmax(Q @ K^T / sqrt(d_k)) @ V. Each part has a specific motivation. This post traces every step — from input matrix X to the context-enriched output — with concrete numerical examples.

Q: What is the full attention formula?

Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V

Q: Where do Q, K, V come from?

Each is the input X multiplied by a separate learned weight matrix:

Q = X @ W_Q    # (N, d_model) @ (d_model, d_model) = (N, d_model)
K = X @ W_K    # same shape
V = X @ W_V    # same shape

X is the output of token_embedding + positional_encoding. W_Q, W_K, W_V are learned parameters — updated by gradient descent during training.

Q: What do Q, K, V intuitively represent?

They represent three different “roles” for each token:

Q (queries): "what am I looking for?"
K (keys):    "what do I advertise to others?"
V (values):  "what information do I transmit when selected?"

Think of a conference: Q is your question, K is someone’s name badge, V is the actual knowledge they share once you find them.

Q: Why not just use X directly? Isn’t X already the value carried by each token?

You could do softmax(X @ X^T) @ X and it would technically work. But X contains everything about a token packed into one vector. When “dog” attends to “bites,” maybe it only needs one specific aspect — like the fact that it’s an action verb — not everything about “bites.”

The separate projections let the model decouple three roles:

W_Q: learns what each token should search for
W_K: learns what each token should advertise
W_V: learns what each token should transmit when selected

Using raw X for all three forces one representation to serve three different purposes. Separate weight matrices give the model more flexibility.

Q: Why was Q @ K^T designed this way? Is it just an arbitrary algorithm?

No — it follows from a clear motivation. The fundamental problem: each token needs to gather information from other tokens. “It” needs to figure out that “it” refers to “cat.” So you need a mechanism where each token can ask “which other tokens are relevant to me?”

The simplest way to measure relevance between two vectors is a dot product — similar directions = large value, unrelated = small value.

But what a token “looks for” differs from what it “advertises.” So instead of comparing raw embeddings, separate projections were introduced. The Q/K/V dot-product approach turned out to be both expressive and fast (matrix multiplication is extremely efficient on GPUs).

Could the authors have designed it differently? Yes — earlier attention mechanisms used learned additive scores instead of dot products. But the dot-product approach won because of efficiency.

Q: What does Q @ K^T produce?

An (N, N) matrix where entry (i, j) is the dot product between token i’s query and token j’s key. Higher values mean “token i finds token j more relevant.”

Q is (N, d_model), K^T is (d_model, N), result is (N, N).

Example for "dog bites man" (with positional encoding):

            dog    bites   man
   dog  [  1.25,   0.80,  -0.20]
   bites[  0.80,   1.09,  -0.40]
   man  [ -0.20,  -0.40,   1.16]

Row 0 says: “dog” finds itself most relevant (1.25), “bites” somewhat (0.80), “man” barely (-0.20).

Q: Can you show every matrix step-by-step for “dog bites man”?

Using d_model = 4, identity W_Q and W_K, with positional encoding added:

X = [[1.0,  0.5, 0.0, 0.0],    ← dog + pos0
     [0.3,  1.0, 0.0, 0.0],    ← bites + pos1
     [0.0, -0.4, 1.0, 0.0]]    ← man + pos2

Q = X (identity W_Q)
K = X (identity W_K)

K^T = [[1.0,  0.3,  0.0],
       [0.5,  1.0, -0.4],
       [0.0,  0.0,  1.0],
       [0.0,  0.0,  0.0]]

Q @ K^T computed via dot products:

Row 0 = [1.0, 0.5, 0.0, 0.0] (dog)
  dot col 0: (1.0×1.0)+(0.5×0.5)+0+0 = 1.25
  dot col 1: (1.0×0.3)+(0.5×1.0)+0+0 = 0.80
  dot col 2: (1.0×0.0)+(0.5×-0.4)+0+0 = -0.20

Row 1 = [0.3, 1.0, 0.0, 0.0] (bites)
  dot col 0: (0.3×1.0)+(1.0×0.5)+0+0 = 0.80
  dot col 1: (0.3×0.3)+(1.0×1.0)+0+0 = 1.09
  dot col 2: (0.3×0.0)+(1.0×-0.4)+0+0 = -0.40

Row 2 = [0.0, -0.4, 1.0, 0.0] (man)
  dot col 0: (0.0×1.0)+(-0.4×0.5)+0+0 = -0.20
  dot col 1: (0.0×0.3)+(-0.4×1.0)+0+0 = -0.40
  dot col 2: (0.0×0.0)+(-0.4×-0.4)+(1.0×1.0)+0 = 1.16

Result:
            dog    bites   man
   dog  [  1.25,   0.80,  -0.20]
   bites[  0.80,   1.09,  -0.40]
   man  [ -0.20,  -0.40,   1.16]

Q: What is d_k and is it the same as d_model?

In single-head attention, yes — d_k = d_model. But d_k exists as a separate term because in multi-head attention, d_k = d_model / num_heads. With d_model = 512 and 8 heads, d_k = 64. The formula uses d_k to stay general.