Note from my discussion with Claude Opus 4.7, drafted by Sonnet 4.6
Paper: Deformable DETR (Zhu et al., 2020), §4.1.
Core insight
Instead of attending to every spatial location (cost ), each query learns K spatial offsets from a reference point and only attends to those K locations. Offsets and weights both come from the query alone — no query-key dot product.
The formula (single-scale)
| Symbol | Meaning |
|---|---|
| Query feature vector (-dim) | |
| Reference point (2D coordinate) | |
| Feature map () | |
| Head index ( heads, default 8) | |
| Sample index ( per head, default 4) | |
| Sampling offset (2D, unconstrained range) | |
| Attention weight; per head | |
| Value projection matrix (per head) | |
| Output projection matrix (per head) |
How offsets and weights are computed
Both and come from a single linear projection on with output channels:
- First channels → reshape to → the offsets . Unconstrained; can point anywhere in the image.
- Last channels → softmax over per head → the weights .
Both and branch from a single linear projection on . No feature map keys are consulted.
No keys involved
Unlike Multi-Head Attention, the weights are not computed from a query-key dot product. The query alone decides where to look and how much to weight each location. This is the defining departure from standard attention.
Sparse in count, not in reach
The constraint is on how many locations are sampled (K=4 per head), not how far the offsets can reach — is unconstrained in range. This is the key distinction from local-window attention (e.g. Swin Transformer) or CNNs, which restrict the radius. DeformAttn restricts the budget.
The sample locations are generally fractional coordinates, so the feature map is read via bilinear interpolation — same mechanism as Deformable Convolution.

Multi-scale extension
Changes from single-scale:
- feature levels (e.g. from ResNet C3–C5 + one extra strided conv).
- Reference point is normalized; rescales it to pixel coords at level .
- Attention weights are now softmaxed over all combinations per head.
- Each head samples total locations, routing freely across scales.
Encoder vs decoder usage
Encoder — each pixel is its own query and reference point. A learned scale-level embedding distinguishes which feature level each pixel comes from. FPN not used — multi-scale attention already exchanges cross-level information.
Decoder cross-attention — object queries; reference point predicted from query embedding. Box predictions are relative offsets w.r.t. the reference point, not absolute coordinates (see A.3).
Decoder self-attention — standard attention, unchanged (only queries, cost is small).
Decoder object queries and the reference point
In DETR, the object queries carry no spatial information — specialization emerges implicitly through bipartite matching over hundreds of epochs. Deformable DETR makes spatial grounding explicit: the reference point gives each query a 2D anchor from the start.
The two-stage variant pushes further — spatial grounding comes entirely from the encoder:
| Variant | Reference point source | Query source |
|---|---|---|
| DETR | none — fully implicit | learned embedding |
| Deformable DETR (1-stage) | learned embedding | |
| Deformable DETR (2-stage) | encoder proposal center | encoder feature at proposal |
The progression
DETR trusts queries to figure out space implicitly. One-stage gives them an explicit anchor. Two-stage doesn’t trust the queries at all — the “elegant slot” idea is quietly retired.
Iterative bounding box refinement (§4.2)
Each of the decoder layers refines the box from the previous layer rather than predicting independently. The reference point for layer is the box center predicted by layer , and sampling offsets are modulated by that box’s predicted width/height — so the attention window shrinks as the box tightens. Detection heads are not shared across layers. See A.4 for the full formula and stop-gradient details.
Relation to deformable convolution
Setting , , recovers deformable convolution exactly. The lineage is Deformable Convolution → DeformAttn — not MHA with deformable sampling bolted on.
When does DeformAttn apply?
DeformAttn is a spatial-domain trick. It has three implicit assumptions that all depend on the feature map being a spatially coherent conv grid:
- Reference points are meaningful 2D coordinates in a continuous space.
- Offsets can point anywhere in that space with real-valued precision.
- Bilinear interpolation can retrieve features at fractional coordinates.
This is why it works when you combine attention with CNNs (conv feature maps are dense, spatially coherent, and continuously interpolable) and why it doesn’t directly translate to a ViT — a ViT token at position has no neighbor at ; there is nothing to interpolate between.
The paper’s 2020 publication date matters here: this is still the CNN+attention hybrid era. The implicit assumption that you have a real spatial feature map is never stated because it was universal.
Rule of thumb
Use DeformAttn when (a) your keys live in a spatially structured feature map, and (b) you know the relevant information is spatially local to the query but don’t know exactly where. If the feature map has no spatial meaning (flat ViT tokens, language sequences), the offset mechanism loses its grounding.