The following notes come from my discussion with Claude Sonnet 4.6

Stable Diffusion 3 — Scaling Rectified Flow Transformers

Abstract

SD3 introduces the Multimodal Diffusion Transformer (MMDiT) — a rectified-flow model that treats text and image tokens as two parallel streams with separate weights but joint attention. Key contributions: (1) a careful empirical study of timestep sampling schedules, (2) the MMDiT architecture, (3) improved text conditioning via multiple encoders, and (4) resolution-aware positional encoding and timestep shifting.

1. Timestep Sampling: Why Log-Normal?

1.1 The Core Problem

Rectified flow trains with a uniform distribution over $t \in [0, 1]$ , but the prediction difficulty is not uniform across timesteps. At the extremes:

$t \approx 0$ : image is nearly clean → optimal prediction is just the mean of $p_{0}$ (trivial)
$t \approx 1$ : image is nearly pure noise → optimal prediction is the mean of $p_{1}$ (also trivial)

The hard, information-rich region is the middle, where signal and noise are genuinely mixed. We want to oversample it.

1.2 Log-SNR: The Natural Coordinate

For rectified flow with $α_{t} = t$ , $β_{t} = 1 - t$ , the log signal-to-noise ratio is:

λ (t) = lo g \frac{α _{t}^{2}}{β _{t}^{2}} = lo g \frac{t ^{2}}{( 1 - t ) ^{2}} = 2 lo g \frac{t}{1 - t} = 2 logit (t)

$λ$ is the natural axis for measuring difficulty: each unit of $λ$ corresponds to a doubling/halving of the signal-to-noise ratio. At $t = 0.5$ , $λ = 0$ — equal signal and noise. As $t \to 0$ , $λ \to - \infty$ (all signal); as $t \to 1$ , $λ \to + \infty$ (all noise).

1.3 Why Log-Normal Specifically

If prediction difficulty is roughly uniform per unit of $λ$ — each octave of SNR deserves equal training — then we want $λ$ uniformly distributed. Since $λ = 2 logit (t)$ , pulling $λ \sim U$ back through the change of variables gives $t$ distributed as logit-normal: place a Gaussian on $logit (t)$ , then invert.

Concretely, the density is:

π_{ln} (t;, m, s) = \frac{1}{s 2 π} \cdot \frac{1}{t ( 1 - t )} exp (- \frac{( logit ( t ) - m ) ^{2}}{2 s ^{2}})

Location $m$ : shifts weight toward data ( $m < 0$ ) or noise ( $m > 0$ ). Default $m = 0$ peaks at $t = 0.5$ .
Scale $s$ : controls width. Larger $s$ → flatter, closer to uniform.
Tails vanish at 0 and 1: the $t (1 - t)$ denominator is the Jacobian of the logit transform — no wasted signal at the trivial endpoints.

1.4 Comparison: π Series Prioritizes Opposite End

SD3’s log-normal peaks in the middle because images need both coarse structure and fine detail — no strong asymmetry.

Pi 0 and the RECAP series (robot learning) do the opposite: up-weight high-noise timesteps (large $t$ , low SNR). For robot action generation, coarse trajectory correctness dominates — a wrong global motion plan fails the task regardless of fine detail. Fine denoising at small $t$ is cheap to recover from; coarse denoising at large $t$ is not. Opposite asymmetry from image generation.

Key takeaway

The right timestep distribution reflects the loss asymmetry of your task. SD3: symmetric → log-normal. Pi 0: coarse matters more → up-weight high noise.

2. Architecture: MMDiT

2.1 Overview

The architecture has three components: text conditioning, the MMDiT backbone, and the VAE. See DiT and Flow Matching for the underlying building blocks.

2.2 Two Conditioning Signals: `c_vec` → `y` and `c_ctxt`

The text is encoded by three frozen models and split into two representations with fundamentally different roles:

c_vec (pooled, global) → becomes y

CLIP-L and OpenCLIP-G pooled outputs are concatenated → $c_{vec} \in R^{2048}$ . Combined with the timestep embedding, fed through an MLP to produce scale/shift/gate parameters of adaLN-zero at every MMDiT block.

Carries: what kind of image is this? — holistic semantic gist
The timestep $t$ naturally lives here too: also a scalar global signal
Mechanism: modulates the gain and bias of every activation uniformly

c_ctxt (full sequence, local) → joint attention

CLIP penultimate hidden states (zero-padded 2048→4096) concatenated with T5-XXL hidden states → $c_{ctxt} \in R^{154 \times 4096}$ (77 CLIP + 77 T5 tokens). Concatenated with image patch tokens for bidirectional joint self-attention.

Carries: which words say what, and where? — token-level spatial grounding
Mechanism: cross-token attention lets patches route to relevant words

The 77-token limit on CLIP comes from its fixed positional embedding table (76 content positions + [EOS]). T5 is also truncated to 77 for uniform concatenation.

Do we actually need y / c_vec in addition to c_ctxt?

Honestly unclear. The pooled vector has a different representational character — CLIP’s [EOS] token is contrastively trained for global image-text similarity, not token-level semantics. adaLN is also a cheaper and more direct broadcast path than attention.

But there is no ablation isolating c_vec’s contribution while keeping c_ctxt. It was inherited from SDXL (found empirically to help) and never seriously questioned. The timestep needs to live somewhere global — adaLN is the natural home — and c_vec is just concatenated to it cheaply.

2.3 MMDiT as Co-Attending Independent Streams

The dual-stream design is often loosely described as a hard-routed MoE, but the analogy is imprecise and worth unpacking. There is a spectrum of parameter sharing:

	QKV	Attention context	FFN
Standard MoE	Shared	Shared	Separate (routed)
MMDiT / Pi0	Separate per modality	Shared (K, V concatenated)	Separate per modality
Fully separate	Separate	Separate	Separate

MMDiT is actually more separated than standard MoE, not a variant of it. Standard MoE’s logic is: tokens live in one shared representation space (shared QKV), but per-position computation specializes (separate FFN). MMDiT says: modalities have such different statistical characters that even the projection into Q/K/V space should be separate — but they still need to cross-attend.

The only joint operation is the attention context: each stream’s K and V are concatenated before softmax, so every token (text or image) attends over the full combined key-value set. A better name than “MoE” is co-attending independent streams — two fully independent weight sets that read each other’s working memory through attention.

Pi 0 uses the exact same mechanism — separate QKV weights per expert, concatenated for a single joint attention pass. See Pi 0 for implementation details and the distinction between the paper’s “blockwise causal mask” framing and what the code actually does.

2.4 Improved Text Encoders and Synthetic Captions

SD3 uses larger encoders than previous SD versions (CLIP bigG + T5-XXL vs CLIP-L alone). Training images are re-captioned using a separate VLM to generate dense, descriptive context labels — the model sees both the original human caption and the synthetic VLM caption at a 50/50 ratio. See Hi Robot for a similar synthetic captioning approach applied to robot data. The synthetic captions are more compositionally detailed and help the model learn fine-grained attribute binding.

3. QK Normalization

SD3 applies RMSNorm with learnable scale to Q and K vectors in both streams before computing attention logits.

The problem: when fine-tuning at higher resolutions, patch token count grows quadratically, attention logits grow unboundedly → entropy explodes → training diverges. First documented for large ViTs (Dehghani et al. 2023, ViT-22B).

Why it works: normalizing Q and K bounds all attention logits by $∣ q ∣∣ k ∣ \leq C$ , preventing softmax saturation and “winner-take-all” collapse.

Is it specific to diffusion models? No — but more critical for them. Variable resolution training creates extreme sequence length variation, triggering logit explosion more acutely than fixed-length LLM training. LLMs that use it: Gemma 2/3, OLMo 2, OpenELM. One notable incompatibility: QK-norm requires materializing full Q/K vectors, making it incompatible with MLA (DeepSeek’s multi-latent attention), where Q/K are reconstructed from low-rank factors at inference time.

Adopted as a headline change in SD3.5 (enabling stable 8B training), and present in Flux.1 (both doubleand single-stream blocks).

Tip

QK-norm constrains attention scores to a hypersphere — all comparisons become cosine similarities, bounded in $[- 1, 1]$ regardless of depth or sequence length.

4. Positional Encoding for Varying Aspect Ratios

SD3 uses 2D sinusoidal frequency embeddings over a canonical coordinate grid. The challenge: embeddings must be physically consistent across aspect ratios — “far right” should mean the same thing in square, wide, or tall images.

The approach: build a canonical grid spanning the maximum extent across all aspect ratio buckets:

grid_{h} = (\frac{p - \frac{h _{max} - s}{2}}{S /256})_{p = 0}^{h_{max} - 1}, similarly for width

where $s = S /16$ is the latent size (after VAE + patching), $h_{max}$ is the tallest latent across all buckets, $S$ is target resolution. For any specific image, take a center crop of this canonical grid.

Why center-crop rather than interpolate? ViT-style interpolation distorts physical meaning — “position 50” means a different spatial fraction at different aspect ratios. Center-cropping from a fixed coordinate system preserves it: every position value corresponds to a fixed spatial distance regardless of image shape.

RoPE supersedes this

Flux.1 replaces sinusoidal absolute embeddings with RoPE — positions encoded as rotations of Q/K vectors, so only relative positions enter the attention score. The canonical-grid + center-crop mechanism becomes unnecessary. RoPE generalizes to unseen resolutions without any coordinate bookkeeping.

5. Resolution-Dependent Timestep Shifting

Core observation: the same $t$ destroys different amounts of signal at different resolutions. A model trained at $25 6^{2}$ is miscalibrated at $102 4^{2}$ .

Derivation via uncertainty matching:

Consider a constant image (every pixel = $c$ ) at resolution $n = H \times W$ . Forward process: $z_{t} = (1 - t) c 1 + t ϵ$ . To recover $c$ , average the pixels:

\overset{c}{^} = \frac{1}{1 - t} \cdot \frac{1}{n} i \sum z_{t, i}, σ (\overset{c}{^}) = \frac{t}{( 1 - t ) n}

Higher resolution → more pixels → lower uncertainty at the same $t$ . Matching uncertainty $σ (t_{n}, n) = σ (t_{m}, m)$ :

t_{m} = \frac{α , t _{n}}{1 + ( α - 1 ) , t _{n}}, α = \frac{m}{n}

For $m > n$ : $α > 1$ , so $t_{m} > t_{n}$ — shift toward later timesteps (more noise) at higher resolution. In log-SNR coordinates, this is simply a constant translation: $λ_{t_{m}} = λ_{t_{n}} - 2 lo g α$ .

In practice, $α = 3.0$ for $102 4^{2}$ training (found via human preference study), applied at both training and sampling time.

Note

The derivation assumes a constant image, which is unrealistic. It gives the right functional form; the exact $α$ is tuned empirically.

6. Subsequent Work (Brief)

SD3.5: QK-norm added (enabling stable 8B training), dual attention layers per MMDiT block (or image-only self-attention in Medium variant, MMDiT-X). Same text encoders and VAE.
Flux.1: hybrid of 19 dual-stream (MMDiT) + 38 single-stream blocks with shared weights. Replaces sinusoidal positions with RoPE. Drops CLIP-G. Adds guidance distillation. Dual→single stream progression: early layers need modality-specialized experts; later layers share capacity once representations have aligned.

Yanda's Random Notes

Explorer

Stable Diffusion 3

Stable Diffusion 3 — Scaling Rectified Flow Transformers

1. Timestep Sampling: Why Log-Normal?

1.1 The Core Problem

1.2 Log-SNR: The Natural Coordinate

1.3 Why Log-Normal Specifically

1.4 Comparison: π Series Prioritizes Opposite End

2. Architecture: MMDiT

2.1 Overview

2.2 Two Conditioning Signals: `c_vec` → `y` and `c_ctxt`

2.3 MMDiT as Co-Attending Independent Streams

2.4 Improved Text Encoders and Synthetic Captions

3. QK Normalization

4. Positional Encoding for Varying Aspect Ratios

5. Resolution-Dependent Timestep Shifting

6. Subsequent Work (Brief)

Graph View

Table of Contents

Backlinks

Yanda's Random Notes

Explorer

Stable Diffusion 3

Stable Diffusion 3 — Scaling Rectified Flow Transformers

1. Timestep Sampling: Why Log-Normal?

1.1 The Core Problem

1.2 Log-SNR: The Natural Coordinate

1.3 Why Log-Normal Specifically

1.4 Comparison: π Series Prioritizes Opposite End

2. Architecture: MMDiT

2.1 Overview

2.2 Two Conditioning Signals: c_vec → y and c_ctxt

2.3 MMDiT as Co-Attending Independent Streams

2.4 Improved Text Encoders and Synthetic Captions

3. QK Normalization

4. Positional Encoding for Varying Aspect Ratios

5. Resolution-Dependent Timestep Shifting

6. Subsequent Work (Brief)

Graph View

Table of Contents

Backlinks

2.2 Two Conditioning Signals: `c_vec` → `y` and `c_ctxt`