Source

This is all from course material of Introduction to Flow Matching and Diffusion Models. MIT Computer Science Class 6.S184: Generative AI with Stochastic Differential Equations.

Take the note as the ground truth, here I’ll just list out some brief idea for quick overview.


Some parts of the note below comes from my conversation with Claude Sonnet 4.6 when reading Stable Diffusion 3 paper. Most of the notes are hand written.


The goal is to go from to . There are several ways to do that. If we try to predict directly, that will be GAN. The problem is that the training is unstable because the reward is sparse.

So imagine for example, if we want to convert a pile of dirt into a mountain, someone just telling you that you missed or hit it is not enough. Flow matching is basically saying, “Why don’t we provide guidance in the process?” so the whole thing is easier to fit.

We could say that the flow-match algorithm basically converts a generation problem into a supervised learning problem; you are essentially supervised on the given velocity path that we know would work.

There is another critical reason why the flow-matching algorithm is so great: it only needs several trajectories that could work. It doesn’t require a single, definitive trajectory; it can simply generate a multi-hypothesis output because of a very neat mathematical property. We’ll see that later.


A flow model is then described by the ODE

Our goal is to make the endpoint of the trajectory have distribution , i.e.

The problem: to train , we’d need to supervise it on the marginal velocity field:

This is intractable — it requires averaging over all consistent with , which we can’t compute.


The Key Theorem: Conditional = Marginal

The neat property that makes flow matching work: the conditional and marginal losses have the same minimizer (differ only by a constant w.r.t. ):

So instead of supervising on the intractable marginal, we supervise on the conditional velocity — which we can compute because we sampled from the dataset. See notes p.20 for the proof.

This also means: we don’t need a single correct trajectory. Any set of conditional paths that reach at will do — the model learns to average over them.


Constructing the Conditional Path

To compute , we need to define a path from noise to data. We build an interpolant:

Boundary conditions: — pure noise at , pure data at .

When and the interpolant is affine like this, the conditional distribution is Gaussian:

This is the Gaussian CondOT path. “Gaussian” means the conditional path is Gaussian. “CondOT” refers to the coupling — more below.

Why Gaussian? The closed-form chain

Because is affine, you can differentiate it directly w.r.t. to get in closed form. That’s the whole point: Gaussian path → affine by calculus. Non-Gaussian paths exist (stochastic interpolants, discrete CTMC) but lose this.

“CondOT” refers to the coupling: how to pair ‘s with ‘s. Simplest: sample independently. This causes trajectories to cross, making a blurry average. Alternative: minibatch OT coupling — within each training batch, instead of using pairs in random sampling order, solve a small assignment problem to find the permutation that minimises . This pairs each data point with the geometrically closest noise sample, giving shorter and less curved trajectories. Fewer crossings → less ambiguity in → easier to learn. It’s the same minibatch gradient descent loop — just a smarter pairing step inside each batch. In practice the benefit is modest; SD3 and Flux use independent sampling and it works fine.

Different methods just pick different :

MethodNotes
Rectified Flow (SD3, Flux)Linear — straight-line paths
DDPM (VP)Time runs backward (0=data, T=noise)
EDMSignal never scaled; = noise

DDPM convention

DDPM flips the time direction: is clean data, is noise. Boundary conditions are reversed. Underneath, all methods trace the same SNR range — different parameterizations of the same transport.


Deriving the Conditional Velocity

Differentiate the interpolant w.r.t. :

This has two unknowns ( and ), linked by . Eliminating :

Rectified Flow specifically

, , , . Substituting :

The cancels — the velocity is constant. Straight-line path = constant direction. This is why RF is fast: an Euler solver with one step is exact for straight lines.


Reparameterization: What Should the Model Predict?

Differentiating gives , with two unknowns constrained by . Knowing and , specifying either or determines the other — so there is one degree of freedom, and the model can predict any one of:

ParameterizationPredictsRecover (RF)
Velocityuse directly
Data (-pred)
Noise (-pred)

All three are equivalent at the optimum, but imply different loss weighting over . For RF, switching to noise prediction scales the per-timestep gradient by , up-weighting high noise. DDPM favored -prediction for this reason; RF naturally uses velocity since is constant.

Why doesn't the model take as input?

At training you know — all three. If the model saw , it could recover algebraically — nothing about would be learned.

At inference there is no “true” pair: you sample and run the ODE. After the first step, is slightly off any straight-line path due to finite step size — feeding would give a spurious estimate. The model instead learns , the average over all consistent pairs, which self-corrects under numerical drift. Not seeing is what forces the model to learn the data manifold.


What’s really going on training and inference

So, what one would do is generate training data. We generate it in a supervised learning way, which is basically sampling here, because you just need to sample:

  1. Your z
  2. Your time step
  3. Your noise (Gaussian noise or something)
  4. Calculate based on these. E.g. .

We just need to make sure the probability path we choose converge to in the end and is in the beginning.

\begin{algorithm}
\caption{Flow Matching Training Procedure (for Gaussian CondOT path $p_t(x|z) = \mathcal{N}(tz, (1 - t)^2)$)}
\begin{algorithmic}
\REQUIRE A dataset of samples $z \sim p_{\text{data}}$, neural network $u_t^\theta$
\FOR{each mini-batch of data}
    \STATE Sample a data example $z$ from the dataset.
    \STATE Sample a random time $t \sim \text{Unif}_{[0,1]}$.
    \STATE Sample noise $\epsilon \sim \mathcal{N}(0, I_d)$
    \STATE Set
    \[
    x = tz + (1 - t)\epsilon \quad \quad \text{(General case: } x \sim p_t(\cdot | z)\text{)}
    \]
    \STATE Compute loss
    \[
    \mathcal{L}(\theta) = \|u_t^\theta(x) - (z - \epsilon)\|^2 \quad \quad \text{(General case: } = \|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2\text{)}
    \]
    \STATE Update $\theta \leftarrow \text{grad\_update}(\mathcal{L}(\theta))$.
\ENDFOR
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{Sampling from a Flow Model with Euler method}
\begin{algorithmic}
\REQUIRE Neural network vector field $u_t^\theta$, number of steps $n$
\STATE Set $t = 0$
\STATE Set step size $h = \frac{1}{n}$
\STATE Draw a sample $X_0 \sim p_{\text{init}}$ \COMMENT{Random initialization!}
\FOR{$i = 1, \dots, n-1$}
    \STATE $X_{t+h} = X_t + h u_t^\theta(X_t)$
    \STATE Update $t \leftarrow t + h$
\ENDFOR
\RETURN $X_1$ \COMMENT{Return final point}
\end{algorithmic}
\end{algorithm}