This note is from a discussion with Claude Opus 4.7 when reading VAE tutorial.


The trick that makes VAE training work end-to-end. It lets gradients flow through a sampling step.

The problem

We want to maximize an objective of the form:

Gradient descent needs . But controls the distribution we sample from, not the function inside, so:

The expectation itself depends on through the sampling density. The gradient doesn’t commute with the expectation.

The trick

If can be written as a deterministic function of and a -free noise variable:

then the expectation rewrites with the fixed base distribution outside:

Now the gradient passes through cleanly:

A Monte Carlo estimate: sample , compute via autograd, done.

Gaussian case (the VAE one)

For :

The encoder outputs and (logvar, for numerical stability — exponentiating keeps positive without constraints). The sample is a deterministic function of , , ; backprop flows through and into the encoder.

This is why VAE encoder heads output two things rather than a sample.

Versus the Score function estimator

The score-function (REINFORCE) estimator works for any without requiring a reparameterization:

It’s universal but high-variance — multiplies the score, so noise in amplifies into the gradient estimate. Variance reduction (baselines, control variates) helps but rarely closes the gap.

Reparameterization is lower-variance because it uses gradient information from directly (pathwise derivative carries shape information about , not just scalar values). When applicable, prefer it.

Other applications

The same pathwise-gradient construction appears wherever a network needs to inject learnable-scale noise and still get gradients through the noise scale:

  • Noisy top-k gating in Mixture of Experts (Shazeer 2017): with . Structurally identical to but for gating logits rather than posterior samples. Purpose is exploration and load balancing across experts, not posterior approximation, but the mechanism is the same pathwise gradient through .
  • Stochastic policies in continuous-control RL (SAC): action , gradients flow into the policy network through the action. Replaces high-variance Score function policy gradients with low-variance pathwise gradients of the Q-value — one of the reasons SAC works well at scale.

Limitations

Reparameterization requires to be a differentiable function of given . This rules out:

  • Discrete latents: can’t differentiate through a categorical sample. Workarounds: Gumbel-Softmax (continuous relaxation, biased but low-variance), straight-through estimator (zero-bias for the forward pass, biased gradient — the choice in VQ-VAE), or fall back to Score function gradients. See also DPO for discrete sampling.
  • Distributions without nice reparameterizations: gamma, Dirichlet, etc. Generalized reparameterization (Ruiz et al. 2016) and implicit reparameterization (Figurnov et al. 2018) extend the trick using inverse CDFs or rejection sampling.

The discrete case is exactly what motivates VQ-VAE’s codebook + straight-through gradient design.