MLA — Multi-head Latent Attention

Disclaimer: I haven’t read the full paper.

Key idea: project down each key and value vector from $N * H$ dimensions to $C$ dimensions
DeepSeek v2: reduce $N * H = 16384$ to $C = 512$
Wrinkle: MLA is not compatible with RoPE, so need to add additional 64 dimensions for RoPE, so $512 + 64 = 576$ total dimensions

The rest of the note is based on my conversation with Claude Sonnet 4.6:

Core Idea: Bottleneck + Cache Shift

Standard MHA caches K and V at full dimension. MLA instead:

Compress x → latent c_KV (small, cache this)
Decompress c_KV → K, V at inference time

Key insight

We shift where we cache. Instead of caching K, V ∈ ℝ^d, we cache c_KV ∈ ℝ^{d_c} where $d_{c} ≪ d$ . The decompression back to full rank happens on-the-fly and is not stored.

Dimensions

Let $d$ = d_model, $h$ = num heads, $d_{h} = d / h$ per-head dim.

	KV cache / token	Expressiveness
Standard MHA	$2 d$	Full rank
Shrink K/V heads	$2 d_{c}$	Reduced (small heads)
MLA	$d_{c}$	Full rank (decompressed)

Compression:

c_{K V} = x W_{DK V}, W_{DK V} \in R^{d \times d_{c}}

Decompression:

K = c_{K V} W_{U K}, V = c_{K V} W_{U V}, W_{U K}, W_{U V} \in R^{d_{c} \times d}

Is it equivalent to just shrinking K/V?

No. Shrinking K/V heads means each head genuinely has fewer dimensions — less expressive. MLA caches small but reconstructs full-rank K/V via learned up-projections. The bottleneck is in the cache, not in the attention computation.

The Absorption Trick (avoid materializing K/V)

Naive MLA does 2 BMMs to get K and V before attention — worse than standard MHA. The trick: absorb the up-projection weights into Q and output projection.

scores = Q W_{U K}^{⊤} C_{K V}^{⊤} = Q^{'} (Q W_{U K}^{⊤}) C_{K V}^{⊤}

So redefine $Q^{'} = Q W_{U K}^{⊤}$ (merged into $W_{Q}$ , done once), then attention runs directly against cached latents $C_{K V}$ :

out = softmax (Q^{'} C_{K V}^{⊤}) \cdot C_{K V} W_{U V}^{⊤}

where $W_{U V}^{⊤}$ is absorbed into the output projection $W_{O}$ .

Result

At inference, still 1 BMM against the cache — same as standard MHA, but the cache is $d_{c}$ wide instead of $2 d$ . Pure win on memory, no extra compute.

RoPE Caveat

RoPE is position-dependent, so it cannot be absorbed into a static weight matrix. MLA keeps a small separate $K_{rope}$ (a few dims per head) that is materialized explicitly and cached separately. This bypasses the compression for positional info only.

Yanda's Random Notes

Explorer

MLA

MLA — Multi-head Latent Attention

Core Idea: Bottleneck + Cache Shift

Dimensions

Is it equivalent to just shrinking K/V?

The Absorption Trick (avoid materializing K/V)

RoPE Caveat

Graph View

Table of Contents

Backlinks