KV Cache

This is inspired and generated by my conversation with Claude Sonnet 4.6

Two phases of inference

Every LLM inference call splits into two distinct phases:

Prefill — the full prompt ( $S$ tokens) is processed in one parallel forward pass, identical to training. Produces the KV cache and the first generated token.
Generation — one new token at a time, using the cached K and V from all previous tokens. Sequential and memory-bandwidth-bound.

Key insight

Prefill is compute-bound (matrix × matrix). Generation is memory-bandwidth-bound (reading the KV cache dominates).

Analogy: encoder-decoder

Prefill resembles an encoder — processes the full context in parallel, produces K, V for every position. Generation resembles a decoder — one step at a time, attending to that context via Q. The differences: in a decoder-only model the weights are shared across both phases, and the “encoder output” (KV cache) grows by one slot each step as generated tokens append their own K, V. You don’t need a separate encoder module because the same weights serve both purposes depending on which phase you’re in.

Why only K and V are cached

See also Building self attention In attention, each token plays three roles:

Vector	Question it answers	Depends on
Q (query)	“What am I looking for?”	Only this token’s representation
K (key)	“What do I have to offer?”	Only this token’s representation
V (value)	“What information do I carry?”	Only this token’s representation

During generation, the only token asking a question is the new one — old tokens already asked theirs in prefill, and we discarded those intermediate results (we only kept the last logit). Old Q vectors are useless going forward.

K and V are properties of each token that any future query might attend to. Token “cat” will always produce the same K and V regardless of what comes after it, because they only depend on that token’s own input representation. They are safe to cache forever.

The generation attention computation becomes:

out = softmax (\frac{q _{new} \cdot K _{all}^{T}}{d _{head}}) V_{all}

where $q_{new}$ has shape $[1,, d_{head}]$ and $K_{all},, V_{all}$ have shape $[S + 1,, d_{head}]$ from the cache.

Caching Q would be useless

$q_{new}$ must be computed fresh each step (the token doesn’t exist yet), and old Q vectors serve no future purpose.

KV cache structure

The cache holds one K and one V tensor per layer, per token, per head:

shape: [n_{layers}, S, d_{head}] for each of K and V

Total memory:

2 \times n_{layers} \times S \times d_{head} \times bytes

For a typical model ( $n_{layers} = 32$ , $d_{head} = 128$ , bf16):

2 \times 32 \times S \times 128 \times 2 = 16, 384 \times S bytes per sequence

At $S = 8192$ tokens that is ~128 MB per sequence — this is why long-context serving is expensive.

Why generation is equivalent to a full forward pass

Every op except attention treats the sequence dimension identically to batch:

RMSNorm — pointwise per token, no cross-token interaction
QKV projections — independent linear maps per token
FFN — same $W_{1}, W_{2}$ applied to each token independently; $S$ is literally batch size here

Attention is the only op that mixes information across positions. The KV cache supplies exactly what a full forward pass would have computed — K and V for every past token are identical to what you’d get re-running them through the transformer. So generation with KV cache is mathematically equivalent to naively re-running all $S + 1$ tokens every step. The cache is memoization of the parts that don’t change.

Compute profile comparison

	Prefill	Generation
Attention shape	$[S, S]$ matmul	$[1, S]$ vec-mat
Attention complexity	$O (S^{2})$	$O (S)$ per step
Bottleneck	Compute (ALU)	Memory bandwidth
Parallelism	Full across tokens	Sequential
RMSNorm / FFN shape	$[S, d]$	$[1, d]$

Yanda's Random Notes

Explorer

KV Cache

Two phases of inference

Why only K and V are cached

KV cache structure

Why generation is equivalent to a full forward pass

Compute profile comparison

Graph View

Table of Contents

Backlinks