Variational Inference

This note is from a discussion with Claude Opus 4.7 when reading VAE tutorial.

A framework for turning inference into optimization. The cleanest way in is through EM: VI is what EM becomes when you can’t do the exact E-step.

The problem

Given a probabilistic model $p (x, z)$ with observed $x$ and latent $z$ , Bayesian inference wants the posterior:

p (z ∣ x) = \frac{p ( x ∣ z ) p ( z )}{p ( x )} = \frac{p ( x ∣ z ) p ( z )}{\int p ( x ∣ z ) p ( z ) , d z}

The denominator is intractable for most interesting models — high-dimensional integrals, mixed discrete/continuous structure, sometimes infinite-dimensional latents. Exact inference is off the table.

Detour: EM as coordinate ascent on the ELBO

EM is usually taught as “alternate two steps until convergence,” but the structural reason it works is that both steps are coordinate ascent on the same objective. Recall the identity:

lo g p_{θ} (x) = ELBO (q, θ) + D_{K L} (q (z) ∣ p_{θ} (z ∣ x))

EM is coordinate ascent on the right-hand side:

E-step: maximize over $q$ with $θ$ fixed. If $q$ is unconstrained, the max sets $q = p_{θ} (z ∣ x)$ — the gap drops to zero, $ELBO = lo g p_{θ} (x)$ exactly.
M-step: maximize over $θ$ with $q$ fixed. ELBO goes up (or stays).

This is why EM monotonically improves the likelihood:

After E-step: $ELBO (q_{new}, θ_{old}) = lo g p_{θ_{old}} (x)$ exactly (gap closed).
After M-step: $ELBO (q_{new}, θ_{new}) \geq ELBO (q_{new}, θ_{old})$ .
Always: $lo g p_{θ_{new}} (x) \geq ELBO (q_{new}, θ_{new})$ .

Chained: $lo g p_{θ_{new}} (x) \geq lo g p_{θ_{old}} (x)$ — the likelihood never decreases. The ELBO is a lower bound that touches the true log-likelihood at every E-step, so M-step improvements transfer directly to the real objective. That’s the structural reason EM works, beyond “alternate A then B.”

k-means is a degenerate case: EM on an isotropic GMM in the limit $σ \to 0$ . The posterior over cluster assignments collapses to a point mass on the nearest centroid (hard instead of soft assignment), and the M-step becomes “centroid = mean of assigned points.” All of k-means’s pathologies (local minima, init sensitivity) are inherited from EM, with extra rigidity from hard assignments.

VI as EM with a restricted $q$

The exact E-step requires $p_{θ} (z ∣ x)$ — but that’s the intractable thing. VI’s move: pick a tractable family $Q$ and settle for the best $q$ within it:

q^{*} (z) = ar g q \in Q min D_{K L} (q (z) ∣ p_{θ} (z ∣ x)) = ar g q \in Q max ELBO (q, θ)

The gap doesn’t fully close — there’s a residual $D_{K L} (q^{*} ∣ p_{θ} (z ∣ x)) > 0$ — so the ELBO stays a strict lower bound. M-step improvements still raise the bound, but the likelihood guarantee weakens: you’re now optimizing a surrogate that may not perfectly track $lo g p_{θ} (x)$ . The looser $Q$ , the closer to EM; the tighter, the more tractable.

Every variant of VI is a different choice of $Q$ and a different way to do the maximization.

Variants

Mean-field VI. Factorize $q (z) = \prod_{i} q_{i} (z_{i})$ — assume the latents are independent under $q$ . Coordinate-ascent updates have closed forms for conjugate models. Cheap, but the factorization ignores posterior correlations between latents, which tends to underestimate posterior variance.

Structured VI. Allow some dependence in $q$ but not full — tree-structured, chain-structured. More expressive than mean-field, more expensive.

Amortized variational inference. Instead of separately fitting $q^{(i)} (z)$ for every datapoint, share a neural network $ϕ$ that maps $x \mapsto q_{ϕ} (z ∣ x)$ . The VAE encoder is the canonical example.

Stochastic VI (SVI). Minibatch gradients on the ELBO instead of full-batch coordinate ascent. The thing that made VI work on web-scale topic models.

Black-box VI (BBVI). Generic gradient estimators (Score function or Reparameterization trick) so you don’t need conjugacy. Combined with neural-network $q$ , this is what powers modern deep latent variable models.

Examples beyond VAE

LDA / topic models: mean-field VI over per-document topic mixtures and per-word topic assignments. The application that put VI on the map.
Bayesian neural nets: $q$ over weights (mean-field Gaussian, MC dropout, normalizing flow).
State-space models: VI as an alternative to EKF/particle filters for nonlinear/non-Gaussian dynamics.
Stochastic block models, mixed-membership models, HDPs: pre-deep-learning Bayesian nonparametrics relied heavily on VI.
DDPM / diffusion: the training objective is a per-step ELBO on a $T$ -latent hierarchical model. The equivalent view via Score matching regresses the score field directly.

Yanda's Random Notes

Explorer

Variational Inference

The problem

Detour: EM as coordinate ascent on the ELBO

VI as EM with a restricted $q$

Variants

Examples beyond VAE

Graph View

Table of Contents

Yanda's Random Notes

Explorer

Variational Inference

The problem

Detour: EM as coordinate ascent on the ELBO

VI as EM with a restricted q

Variants

Examples beyond VAE

Graph View

Table of Contents

VI as EM with a restricted $q$