MLA — Multi-head Latent Attention

Disclaimer: I haven’t read the full paper.

  • Key idea: project down each key and value vector from dimensions to dimensions
  • DeepSeek v2: reduce to
  • Wrinkle: MLA is not compatible with RoPE, so need to add additional 64 dimensions for RoPE, so total dimensions

The rest of the note is based on my conversation with Claude Sonnet 4.6:


Core Idea: Bottleneck + Cache Shift

Standard MHA caches K and V at full dimension. MLA instead:

  1. Compress x → latent c_KV (small, cache this)
  2. Decompress c_KV → K, V at inference time

Key insight

We shift where we cache. Instead of caching K, V ∈ ℝ^d, we cache c_KV ∈ ℝ^{d_c} where . The decompression back to full rank happens on-the-fly and is not stored.

Dimensions

Let = d_model, = num heads, per-head dim.

KV cache / tokenExpressiveness
Standard MHAFull rank
Shrink K/V headsReduced (small heads)
MLAFull rank (decompressed)

Compression:

Decompression:

Is it equivalent to just shrinking K/V?

No. Shrinking K/V heads means each head genuinely has fewer dimensions — less expressive. MLA caches small but reconstructs full-rank K/V via learned up-projections. The bottleneck is in the cache, not in the attention computation.

The Absorption Trick (avoid materializing K/V)

Naive MLA does 2 BMMs to get K and V before attention — worse than standard MHA. The trick: absorb the up-projection weights into Q and output projection.

So redefine (merged into , done once), then attention runs directly against cached latents :

where is absorbed into the output projection .

Result

At inference, still 1 BMM against the cache — same as standard MHA, but the cache is wide instead of . Pure win on memory, no extra compute.

RoPE Caveat

RoPE is position-dependent, so it cannot be absorbed into a static weight matrix. MLA keeps a small separate (a few dims per head) that is materialized explicitly and cached separately. This bypasses the compression for positional info only.