Reparameterization trick

This note is from a discussion with Claude Opus 4.7 when reading VAE tutorial.

The trick that makes VAE training work end-to-end. It lets gradients flow through a sampling step.

The problem

We want to maximize an objective of the form:

L (ϕ) = E_{q_{ϕ} (z)} [f (z)]

Gradient descent needs $\nabla_{ϕ} L$ . But $ϕ$ controls the distribution we sample from, not the function inside, so:

\nabla_{ϕ} E_{q_{ϕ} (z)} [f (z)] \neq = E_{q_{ϕ} (z)} [\nabla_{ϕ} f (z)]

The expectation itself depends on $ϕ$ through the sampling density. The gradient doesn’t commute with the expectation.

The trick

If $z$ can be written as a deterministic function of $ϕ$ and a $ϕ$ -free noise variable:

z = g_{ϕ} (ϵ), ϵ \sim p (ϵ)

then the expectation rewrites with the fixed base distribution outside:

E_{q_{ϕ} (z)} [f (z)] = E_{p (ϵ)} [f (g_{ϕ} (ϵ))]

Now the gradient passes through cleanly:

\nabla_{ϕ} E_{p (ϵ)} [f (g_{ϕ} (ϵ))] = E_{p (ϵ)} [\nabla_{ϕ} f (g_{ϕ} (ϵ))]

A Monte Carlo estimate: sample $ϵ$ , compute $\nabla_{ϕ} f (g_{ϕ} (ϵ))$ via autograd, done.

Gaussian case (the VAE one)

For $q_{ϕ} (z ∣ x) = N (z; μ_{ϕ} (x), σ_{ϕ} (x)^{2} I)$ :

z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ, ϵ \sim N (0, I)

The encoder outputs $μ$ and $lo g σ^{2}$ (logvar, for numerical stability — exponentiating keeps $σ^{2}$ positive without constraints). The sample $z$ is a deterministic function of $μ$ , $σ$ , $ϵ$ ; backprop flows through $μ$ and $σ$ into the encoder.

This is why VAE encoder heads output two things rather than a sample.

Versus the Score function estimator

The score-function (REINFORCE) estimator works for any $q_{ϕ}$ without requiring a reparameterization:

\nabla_{ϕ} E_{q_{ϕ}} [f (z)] = E_{q_{ϕ}} [f (z) \nabla_{ϕ} lo g q_{ϕ} (z)]

It’s universal but high-variance — $f (z)$ multiplies the score, so noise in $f$ amplifies into the gradient estimate. Variance reduction (baselines, control variates) helps but rarely closes the gap.

Reparameterization is lower-variance because it uses gradient information from $f$ directly (pathwise derivative carries shape information about $f$ , not just scalar values). When applicable, prefer it.

Other applications

The same pathwise-gradient construction appears wherever a network needs to inject learnable-scale noise and still get gradients through the noise scale:

Noisy top-k gating in Mixture of Experts (Shazeer 2017): $H (x)_{i} = (x W_{g})_{i} + ϵ \cdot softplus ((x W_{n})_{i})$ with $ϵ \sim N (0, 1)$ . Structurally identical to $z = μ + σ ⊙ ϵ$ but for gating logits rather than posterior samples. Purpose is exploration and load balancing across experts, not posterior approximation, but the mechanism is the same pathwise gradient through $W_{n}$ .
Stochastic policies in continuous-control RL (SAC): action $a = μ_{ϕ} (s) + σ_{ϕ} (s) ⊙ ϵ$ , gradients flow into the policy network through the action. Replaces high-variance Score function policy gradients with low-variance pathwise gradients of the Q-value — one of the reasons SAC works well at scale.

Limitations

Reparameterization requires $z$ to be a differentiable function of $ϕ$ given $ϵ$ . This rules out:

Discrete latents: can’t differentiate through a categorical sample. Workarounds: Gumbel-Softmax (continuous relaxation, biased but low-variance), straight-through estimator (zero-bias for the forward pass, biased gradient — the choice in VQ-VAE), or fall back to Score function gradients. See also DPO for discrete sampling.
Distributions without nice reparameterizations: gamma, Dirichlet, etc. Generalized reparameterization (Ruiz et al. 2016) and implicit reparameterization (Figurnov et al. 2018) extend the trick using inverse CDFs or rejection sampling.

The discrete case is exactly what motivates VQ-VAE’s codebook + straight-through gradient design.

Yanda's Random Notes

Explorer