Score function

Generated via Claude 4.6 Opus, resulted from a conversation.

The score function is the gradient of the log-probability with respect to parameters:

$\nabla_{θ} lo g p_{θ} (x)$

It appears across statistics, machine learning, and physics — not by coincidence, but because it encodes the geometry of how probability distributions change with their parameters.

Why Log?

Probabilities have multiplicative structure (independent events multiply), but calculus works in additive spaces. The logarithm bridges the two: it is the unique function (up to scale) that converts multiplicative structure to additive structure while respecting independence. This is the same reason $- lo g p (k)$ appears as optimal code length in coding theory.

For exponential families — the most natural parametric distributions — the log-probability is literally linear in the parameters:

$p_{θ} (x) = exp (θ^{T} T (x) - A (θ)) h (x) ⟹ lo g p_{θ} (x) = θ^{T} T (x) - A (θ) + lo g h (x)$

So the score function $\nabla_{θ} lo g p_{θ} (x) = T (x) - \nabla A (θ)$ is especially clean here.

Two Key Properties

Property 1: Zero Mean

The normalization constraint $\int p_{θ} (x) d x = 1$ holds for all $θ$ . Differentiating both sides:

$\nabla_{θ} \int p_{θ} (x) d x = \int \nabla_{θ} p_{θ} (x) d x = \int p_{θ} (x) \nabla_{θ} lo g p_{θ} (x) d x = 0$

So $E_{p_{θ}} [\nabla_{θ} lo g p_{θ} (x)] = 0$ . This is not an accident — it is a geometric consequence of staying on the probability simplex. Any direction that preserves normalization must have zero expected score.

Property 2: Tangent Vectors on the Probability Manifold

The score function components $\partial_{θ_{i}} lo g p_{θ} (x)$ form the natural basis for the tangent space at $p_{θ}$ on the Statistical Manifold. The inner product of two tangent vectors under the distribution defines the Fisher Information matrix:

$F_{ij} (θ) = E_{p_{θ}} [\frac{\partial l o g p _{θ}}{\partial θ _{i}} \cdot \frac{\partial l o g p _{θ}}{\partial θ _{j}}]$

This gives the Statistical Manifold its Riemannian structure.

The Log-Derivative Trick

The identity $\nabla_{θ} p_{θ} (x) = p_{θ} (x) \nabla_{θ} lo g p_{θ} (x)$ — which is just the chain rule applied to $\nabla lo g p = \nabla p / p$ — converts derivatives of probabilities into expectations:

$\nabla_{θ} E_{p_{θ}} [f (x)] = \nabla_{θ} \int p_{θ} (x) f (x) d x = \int p_{θ} (x) \nabla_{θ} lo g p_{θ} (x) f (x) d x = E_{p_{θ}} [f (x) \nabla_{θ} lo g p_{θ} (x)]$

This matters because we can now estimate the gradient by sampling from $p_{θ}$ — we never need to differentiate through the sampling process itself. This is the core of:

Policy Gradient (REINFORCE): $\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [R (τ) \nabla_{θ} lo g π_{θ} (τ)]$
Black-box variational inference: gradient estimation when reparameterization is unavailable
Evolution strategies and related black-box optimization methods

The zero-mean property (Property 1) is separately useful for variance reduction: since $E [\nabla_{θ} lo g p_{θ}] = 0$ , subtracting any constant baseline $b$ from $f (x)$ does not change the expected gradient, but can dramatically reduce variance.

Score w.r.t. Data: Diffusion Models

In score-based generative models, the relevant object is the score with respect to data rather than parameters:

$\nabla_{x} lo g p_{t} (x)$

This is a vector field pointing toward higher-density regions of $p_{t}$ . The reverse-time SDE (Anderson, 1982) uses this to denoise:

$d x = [f (x, t) - g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t + g (t) d \overset{ˉ}{W}$

where $f$ is the forward drift and $g$ is the diffusion coefficient. The score function is estimated by a neural network trained via denoising score matching.

Same formula, different spaces

The parameter score $\nabla_{θ} lo g p_{θ} (x)$ and the data score $\nabla_{x} lo g p_{t} (x)$ are both “gradient of log-probability,” but they live in completely different spaces and serve different purposes. The parameter score is a vector in parameter space that defines the Fisher Information metric and enables the Policy Gradient trick. The data score is a vector field in data space used for denoising via Langevin dynamics and reverse-time SDEs. The information geometry story (Fisher metric, natural gradient, KL curvature) does not carry over to the data score. What they share is that $lo g p$ is the canonical object for doing calculus with distributions, so $\nabla lo g p$ is the natural “direction” to write down — but the deep reasons each is useful are different (Tweedie’s formula and reverse-time SDEs for diffusion; normalization constraint and tangent space geometry for everything else).

Connections

The score function ties together several ideas:

It defines the tangent space of the Statistical Manifold
Its outer product gives the Fisher Information matrix (the Riemannian metric)
Its zero-mean property enables the Policy Gradient trick
The Natural Policy Gradient uses $F^{- 1} \nabla J$ to move in the geometry the score defines
The local quadratic approximation of KL Divergence is expressed through the Fisher metric built from score functions

Yanda's Random Notes

Explorer

Score function

Why Log?

Two Key Properties

Property 1: Zero Mean

Property 2: Tangent Vectors on the Probability Manifold

The Log-Derivative Trick

Score w.r.t. Data: Diffusion Models

Connections

Graph View

Table of Contents

Backlinks