Generated via Claude 4.6 Opus, resulted from a conversation.
The score function is the gradient of the log-probability with respect to parameters:
It appears across statistics, machine learning, and physics — not by coincidence, but because it encodes the geometry of how probability distributions change with their parameters.
Why Log?
Probabilities have multiplicative structure (independent events multiply), but calculus works in additive spaces. The logarithm bridges the two: it is the unique function (up to scale) that converts multiplicative structure to additive structure while respecting independence. This is the same reason appears as optimal code length in coding theory.
For exponential families — the most natural parametric distributions — the log-probability is literally linear in the parameters:
So the score function is especially clean here.
Two Key Properties
Property 1: Zero Mean
The normalization constraint holds for all . Differentiating both sides:
So . This is not an accident — it is a geometric consequence of staying on the probability simplex. Any direction that preserves normalization must have zero expected score.
Property 2: Tangent Vectors on the Probability Manifold
The score function components form the natural basis for the tangent space at on the Statistical Manifold. The inner product of two tangent vectors under the distribution defines the Fisher Information matrix:
This gives the Statistical Manifold its Riemannian structure.
The Log-Derivative Trick
The identity — which is just the chain rule applied to — converts derivatives of probabilities into expectations:
This matters because we can now estimate the gradient by sampling from — we never need to differentiate through the sampling process itself. This is the core of:
- Policy Gradient (REINFORCE):
- Black-box variational inference: gradient estimation when reparameterization is unavailable
- Evolution strategies and related black-box optimization methods
The zero-mean property (Property 1) is separately useful for variance reduction: since , subtracting any constant baseline from does not change the expected gradient, but can dramatically reduce variance.
Score w.r.t. Data: Diffusion Models
In score-based generative models, the relevant object is the score with respect to data rather than parameters:
This is a vector field pointing toward higher-density regions of . The reverse-time SDE (Anderson, 1982) uses this to denoise:
where is the forward drift and is the diffusion coefficient. The score function is estimated by a neural network trained via denoising score matching.
Same formula, different spaces
The parameter score and the data score are both “gradient of log-probability,” but they live in completely different spaces and serve different purposes. The parameter score is a vector in parameter space that defines the Fisher Information metric and enables the Policy Gradient trick. The data score is a vector field in data space used for denoising via Langevin dynamics and reverse-time SDEs. The information geometry story (Fisher metric, natural gradient, KL curvature) does not carry over to the data score. What they share is that is the canonical object for doing calculus with distributions, so is the natural “direction” to write down — but the deep reasons each is useful are different (Tweedie’s formula and reverse-time SDEs for diffusion; normalization constraint and tangent space geometry for everything else).
Connections
The score function ties together several ideas:
- It defines the tangent space of the Statistical Manifold
- Its outer product gives the Fisher Information matrix (the Riemannian metric)
- Its zero-mean property enables the Policy Gradient trick
- The Natural Policy Gradient uses to move in the geometry the score defines
- The local quadratic approximation of KL Divergence is expressed through the Fisher metric built from score functions