This note is from a discussion with Claude Opus 4.7 when reading VAE tutorial.


The universal objective of Variational inference. Given a latent variable model with intractable posterior , the ELBO is a tractable lower bound on that we can actually optimize.

Definition

Equivalent rewriting:

(Use and pull the terms together.)

The identity

Start from and expand using Bayes’ rule :

Rearranging gives the central identity:

Three facts fall out:

  1. Since KL is nonnegative, — hence “lower bound.”
  2. The gap is the KL between the approximate and true posterior.
  3. The bound is tight when .

This is the form to keep in your head — “evidence = ELBO + gap” decomposes the intractable left side into one computable piece (ELBO) and one positive piece (the gap) that vanishes when matches the true posterior.

Alternative derivation: Jensen

A shorter route that gives the bound but doesn’t expose what the gap is:

Apply Jensen ( is concave, so ):

Clean, but no gap term — for the gap, use the KL-identity derivation above.

The gap shrinks for free

The puzzling part: we want to minimize the gap , but is exactly the thing we can’t compute. How does optimization actually close it?

Look at gradients with respect to the encoder parameters only (decoder held fixed). Since has no dependence:

From the identity:

Every gradient step on that increases the ELBO is exactly a gradient step that decreases the KL gap. You never compute the gap, but encoder optimization closes it anyway. That’s the “for free” part.

For the decoder parameters , both terms on the right change when you update — increasing the ELBO over pushes up and may shift the gap either way. But in joint optimization, the encoder keeps closing the gap while the decoder improves the fit. This is the same identity that makes EM work; an exact E-step () sets the gap to zero and the ELBO becomes the true log-likelihood. See Variational inference for the EM-as-VI framing in full.

Why reverse KL

The chosen direction rather than is forced by tractability — see Forward vs. Reverse KL Geometric Intuition for the mode-seeking vs mass-covering distinction. The short version: expectations under are computable because we sample from ; expectations under are not because is the intractable object. The forward KL has the right “covering” behavior in principle but is unusable here.

Where it shows up

  • Variational Autoencoder (VAE): is a neural encoder, is a neural decoder, the ELBO is the training loss.
  • Amortized variational inference: the general framework VAE instantiates.
  • DDPM / diffusion: the noising chain is a hierarchical latent variable model with frozen forward and learned reverse . The training objective is a per-step ELBO that reparameterizes to noise-prediction MSE (Ho et al. 2020). The score-regression view in Score matching is equivalent up to weighting.
  • Bayesian neural nets: is a distribution over weights (mean-field Gaussian, MC dropout, normalizing flow) trained against a weight prior.
  • LDA / topic models: VI’s pre-deep-learning killer app. factors over per-document topic mixtures and per-word topic assignments.