Generated via Claude 4.6 Opus, resulted from a conversation.
The Fisher information matrix measures how sensitively a probability distribution responds to changes in its parameters. It is defined as the covariance of the Score Function:
Since (the score has zero mean), this is both the second moment and the covariance.
Equivalent Form
Under regularity conditions (exchange of differentiation and integration), there is an equivalent expression as the negative expected Hessian of the log-likelihood:
The equivalence follows from differentiating the zero-mean identity a second time. This form is often more convenient for computation, especially in exponential families where has a clean closed form.
As a Riemannian Metric
The Fisher information matrix is the unique (up to scale) Riemannian metric on the Statistical Manifold that is invariant under sufficient statistics — this is Čencov’s theorem.
Concretely, it defines the infinitesimal distance between nearby distributions:
This tells you that some parameter directions change the distribution a lot (large eigenvalues of ) while others barely affect it (small eigenvalues). Euclidean distance in parameter space ignores this entirely.
The connection to KL Divergence is direct: for nearby distributions and , the Taylor expansion of KL divergence gives
The first two terms of the expansion vanish (the zeroth by , the first because ). So the Fisher matrix is the Hessian of KL divergence at coincidence — it is the local curvature of KL divergence.
Natural Gradient
Standard gradient descent treats all parameter directions equally — it uses the Euclidean metric (the identity matrix). But on the Statistical Manifold, the natural metric is the Fisher matrix. The natural gradient corrects for this:
This is the steepest ascent direction in the geometry of distributions rather than in parameter space. It is invariant to reparameterization: if you change coordinates , the natural gradient update produces the same change in .
This is exactly what the Natural Policy Gradient uses, and it motivates TRPO, which enforces a trust region directly in KL divergence.
Cramér-Rao Bound
For any unbiased estimator of , the covariance of the estimator is bounded below:
in the positive semidefinite sense. This means the Fisher information quantifies the best possible precision of any unbiased estimator. High Fisher information at means the data is informative about — the distribution changes rapidly, so observations can pin down the parameter.
Connection to the Information Filter
The “information” in the Kalman Filter’s information form is genuinely Fisher information. For a Gaussian with known covariance, the Fisher information about the mean from a single observation is — the precision matrix.
In the information filter’s natural parameterization (, ), the measurement update becomes:
The term is exactly the Fisher information that observation provides about the state . Conditioning on a new measurement = adding its Fisher information to the current precision. This is why the natural parameterization makes the update additive — Fisher information from independent observations adds.
The Kalman filter’s covariance achieves the Cramér-Rao bound: it is the minimum-variance estimator for the linear-Gaussian setting, and is the total accumulated Fisher information from all observations.
Examples
Gaussian with parameters :
The off-diagonal is zero because mean and variance are informationally orthogonal. Estimating the mean is easier (scales as ) than estimating the variance (scales as ). The natural gradient would take larger steps in the variance direction to compensate.
Categorical distribution with probabilities : the Fisher metric is , which is the metric that makes the Statistical Manifold of categorical distributions a sphere under the square-root parameterization .