This is inspired and generated by my conversation with Claude Sonnet 4.6


Two phases of inference

Every LLM inference call splits into two distinct phases:

  • Prefill — the full prompt ( tokens) is processed in one parallel forward pass, identical to training. Produces the KV cache and the first generated token.
  • Generation — one new token at a time, using the cached K and V from all previous tokens. Sequential and memory-bandwidth-bound.

Key insight

Prefill is compute-bound (matrix × matrix). Generation is memory-bandwidth-bound (reading the KV cache dominates).

Analogy: encoder-decoder

Prefill resembles an encoder — processes the full context in parallel, produces K, V for every position. Generation resembles a decoder — one step at a time, attending to that context via Q. The differences: in a decoder-only model the weights are shared across both phases, and the “encoder output” (KV cache) grows by one slot each step as generated tokens append their own K, V. You don’t need a separate encoder module because the same weights serve both purposes depending on which phase you’re in.


Why only K and V are cached

See also Building self attention In attention, each token plays three roles:

VectorQuestion it answersDepends on
Q (query)“What am I looking for?”Only this token’s representation
K (key)“What do I have to offer?”Only this token’s representation
V (value)“What information do I carry?”Only this token’s representation

During generation, the only token asking a question is the new one — old tokens already asked theirs in prefill, and we discarded those intermediate results (we only kept the last logit). Old Q vectors are useless going forward.

K and V are properties of each token that any future query might attend to. Token “cat” will always produce the same K and V regardless of what comes after it, because they only depend on that token’s own input representation. They are safe to cache forever.

The generation attention computation becomes:

where has shape and have shape from the cache.

Caching Q would be useless

must be computed fresh each step (the token doesn’t exist yet), and old Q vectors serve no future purpose.


KV cache structure

The cache holds one K and one V tensor per layer, per token, per head:

Total memory:

For a typical model (, , bf16):

At tokens that is ~128 MB per sequence — this is why long-context serving is expensive.



Why generation is equivalent to a full forward pass

Every op except attention treats the sequence dimension identically to batch:

  • RMSNorm — pointwise per token, no cross-token interaction
  • QKV projections — independent linear maps per token
  • FFN — same applied to each token independently; is literally batch size here

Attention is the only op that mixes information across positions. The KV cache supplies exactly what a full forward pass would have computed — K and V for every past token are identical to what you’d get re-running them through the transformer. So generation with KV cache is mathematically equivalent to naively re-running all tokens every step. The cache is memoization of the parts that don’t change.


Compute profile comparison

PrefillGeneration
Attention shape matmul vec-mat
Attention complexity per step
BottleneckCompute (ALU)Memory bandwidth
ParallelismFull across tokensSequential
RMSNorm / FFN shape