Yanda's Random Notes

❯

❯

FlashAttention

Dec 03, 20241 min read

Flash Attention speed up attention computation by

Incrementally compute softmax without fully materialize the full matmul result. (tiling)
Recompute intermediate result instead of storing them in backward pass by storing some extra info. What’s not covered in the paper is “how do you know it’s HBM access making it slow” in the first place.

Standard attention

$N$ is sequence length and $d$ is the head dimension

k S = Q K^{T} \in R^{N \times N}, P = softmax (S) \in R^{N \times N}, O = P V \in R^{N \times d}

Flash Attention

TODO: Add the Latex formula here.

The backward pass typically requires the matrices $S, P \in R^{N \times N}$ to compute the gradients with respect to Q, K, V. However, by storing the output O and the softmax normalization statistics (𝑚, ℓ), we can recompute the attention matrix S and P easily in the backward pass from blocks of Q, K, V in SRAM.

flash_attention, page 5

Graph View

Standard attention
Flash Attention

Backlinks

DINOv2

Created with Quartz v4.5.2 © 2026