ELBO

This note is from a discussion with Claude Opus 4.7 when reading VAE tutorial.

The universal objective of Variational inference. Given a latent variable model $p_{θ} (x, z) = p_{θ} (x ∣ z) p (z)$ with intractable posterior $p_{θ} (z ∣ x)$ , the ELBO is a tractable lower bound on $lo g p_{θ} (x)$ that we can actually optimize.

Definition

ELBO (θ, ϕ; x) = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{K L} (q_{ϕ} (z ∣ x) ∣ p (z))

Equivalent rewriting:

ELBO = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x, z) - lo g q_{ϕ} (z ∣ x)]

(Use $lo g p (x, z) = lo g p (x ∣ z) + lo g p (z)$ and pull the $lo g p (z) - lo g q$ terms together.)

The identity

Start from $D_{K L} (q_{ϕ} ∣ p_{θ} (z ∣ x))$ and expand using Bayes’ rule $lo g p_{θ} (z ∣ x) = lo g p_{θ} (x, z) - lo g p_{θ} (x)$ :

D_{K L} (q_{ϕ} (z ∣ x) ∣ p_{θ} (z ∣ x)) = E_{q_{ϕ}} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (z ∣ x)] = E_{q_{ϕ}} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (x, z)] + lo g p_{θ} (x)

Rearranging gives the central identity:

lo g p_{θ} (x) = ELBO E_{q_{ϕ}} [lo g p_{θ} (x, z) - lo g q_{ϕ} (z ∣ x)] + D_{K L} (q_{ϕ} (z ∣ x) ∣ p_{θ} (z ∣ x))

Three facts fall out:

Since KL is nonnegative, $lo g p_{θ} (x) \geq ELBO$ — hence “lower bound.”
The gap is the KL between the approximate and true posterior.
The bound is tight when $q_{ϕ} (z ∣ x) = p_{θ} (z ∣ x)$ .

This is the form to keep in your head — “evidence = ELBO + gap” decomposes the intractable left side into one computable piece (ELBO) and one positive piece (the gap) that vanishes when $q$ matches the true posterior.

Alternative derivation: Jensen

A shorter route that gives the bound but doesn’t expose what the gap is:

lo g p_{θ} (x) = lo g \int p_{θ} (x, z), d z = lo g E_{q_{ϕ} (z ∣ x)}! [\frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}]

Apply Jensen ( $lo g$ is concave, so $lo g E [\cdot] \geq E [lo g \cdot]$ ):

lo g p_{θ} (x) \geq E_{q_{ϕ}}! [lo g \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}] = ELBO

Clean, but no gap term — for the gap, use the KL-identity derivation above.

The gap shrinks for free

The puzzling part: we want to minimize the gap $D_{K L} (q_{ϕ} ∣ p_{θ} (z ∣ x))$ , but $p_{θ} (z ∣ x)$ is exactly the thing we can’t compute. How does optimization actually close it?

Look at gradients with respect to the encoder parameters $ϕ$ only (decoder $θ$ held fixed). Since $lo g p_{θ} (x)$ has no $ϕ$ dependence:

\frac{\partial}{\partial ϕ} lo g p_{θ} (x) = 0

From the identity:

\frac{\partial ELBO}{\partial ϕ} = - \frac{\partial D _{K L} ( q _{ϕ} ∣ p _{θ} ( z ∣ x ))}{\partial ϕ}

Every gradient step on $ϕ$ that increases the ELBO is exactly a gradient step that decreases the KL gap. You never compute the gap, but encoder optimization closes it anyway. That’s the “for free” part.

For the decoder parameters $θ$ , both terms on the right change when you update — increasing the ELBO over $θ$ pushes $lo g p_{θ} (x)$ up and may shift the gap either way. But in joint optimization, the encoder keeps closing the gap while the decoder improves the fit. This is the same identity that makes EM work; an exact E-step ( $q \leftarrow p_{θ} (z ∣ x)$ ) sets the gap to zero and the ELBO becomes the true log-likelihood. See Variational inference for the EM-as-VI framing in full.

Why reverse KL

The chosen direction $D_{K L} (q ∣ p (z ∣ x))$ rather than $D_{K L} (p (z ∣ x) ∣ q)$ is forced by tractability — see Forward vs. Reverse KL Geometric Intuition for the mode-seeking vs mass-covering distinction. The short version: expectations under $q$ are computable because we sample from $q$ ; expectations under $p (z ∣ x)$ are not because $p (z ∣ x)$ is the intractable object. The forward KL has the right “covering” behavior in principle but is unusable here.

Where it shows up

Variational Autoencoder (VAE): $q_{ϕ}$ is a neural encoder, $p_{θ}$ is a neural decoder, the ELBO is the training loss.
Amortized variational inference: the general framework VAE instantiates.
DDPM / diffusion: the noising chain $x_{0} \to x_{1} \to \dots \to x_{T}$ is a hierarchical latent variable model with frozen forward $q$ and learned reverse $p_{θ}$ . The training objective is a per-step ELBO that reparameterizes to noise-prediction MSE (Ho et al. 2020). The score-regression view in Score matching is equivalent up to weighting.
Bayesian neural nets: $q_{ϕ}$ is a distribution over weights (mean-field Gaussian, MC dropout, normalizing flow) trained against a weight prior.
LDA / topic models: VI’s pre-deep-learning killer app. $q$ factors over per-document topic mixtures and per-word topic assignments.

Yanda's Random Notes

Explorer