Generated via Claude 4.6 Opus, resulted from a conversation.


The Kullback-Leibler divergence measures how one probability distribution differs from a reference distribution :

It has several independent derivations. The most concrete starts from coding theory.


Coding Theory Motivation

Why start from coding theory for a quantity about probability distributions?

A probability distribution assigns beliefs over outcomes. The coding question asks: if you had to act on those beliefs — commit resources under uncertainty — how costly would it be? Code length is the most stripped-down version of that: allocating a finite resource (bits) based on how likely you think each outcome is.

keeps appearing not because we care about bits per se, but because the logarithm is the unique function that converts multiplicative probability structure into additive cost structure while respecting independence. Coding theory is just the cleanest context where that becomes concrete — you can point at a literal string of bits and say “this is what your wrong beliefs cost you.”

You could skip coding entirely: define by formula, prove non-negativity, show uniqueness from axioms. That’s mathematically complete but unmotivated. The coding interpretation gives physical intuition for why each piece is there — why weights, why , why the ratio .

The Setup: Encoding Messages

Suppose we need to transmit a sequence of symbols (events, classes, outcomes) over a channel. Each symbol is drawn from some distribution . We want to assign a binary code to each symbol — shorter codes for frequent symbols, longer codes for rare ones.

For example, if we have four symbols with equal probability , we need 2 bits each: 00, 01, 10, 11. But if symbol A occurs 50% of the time, we’d prefer to give it a shorter code.

The Kraft Inequality: What Constrains Code Lengths

For a code to be uniquely decodable (no codeword is a prefix of another), the code lengths must satisfy:

This is the Kraft inequality. It’s a hard constraint from the structure of binary prefix codes — it limits how many short codewords you can have. If you make one code shorter, others must get longer.

Optimal Code Lengths → Entropy

Given this constraint, what code lengths minimize the expected cost ? This is a constrained optimization problem. Using Lagrange multipliers, the solution is:

Plugging back in, the minimum achievable expected code length is:

This is entropy — not defined axiomatically, but derived as the solution to “minimize expected code length subject to Kraft.” That’s Shannon’s source coding theorem: no lossless code beats , and codes exist that come arbitrarily close.

Note

Fractional bits is generally not an integer. Practical codes like Huffman coding round to integers (achieving average length between and ). Arithmetic coding gets arbitrarily close to by encoding sequences jointly rather than symbol-by-symbol. The theorem is an asymptotic statement about what’s achievable in principle.

Using the Wrong Code → Cross-Entropy

Now suppose reality follows , but we design our code for distribution . By the same Kraft argument, the optimal code lengths for are . But symbols still occur with frequency . The expected code length becomes:

This is cross-entropy. The weighting doesn’t change — reality is still — we’ve just plugged in the wrong code lengths.

The Excess Cost → KL Divergence

The difference between what we pay and what we could have paid:

KL divergence is the extra bits per symbol you pay for using instead of .

"Surprise" is an interpretation, not the foundation

is often introduced as the “surprise” of event , with entropy defined as “expected surprise.” This is a useful interpretation — the quantity is indeed zero for certain events, large for rare ones, and additive for independent events. But these properties aren’t axioms chosen because “surprise should work this way.” They’re consequences of being the optimal code length under the Kraft inequality. “Surprise” is a name we give to the result after the fact.


Other Derivations

Hypothesis Testing

Given i.i.d. samples from , the expected log-likelihood ratio between and is:

KL divergence is the expected evidence per sample in favor of over when is true. This connects to Stein’s lemma: governs the exponential decay rate of Type II error in hypothesis testing.

Axiomatic Uniqueness

KL divergence can be characterized as essentially the unique divergence satisfying:

  • with equality iff
  • Additivity over independent variables
  • Invariance under sufficient statistics (the data processing inequality)

This is formalized through Csiszár’s work on -divergences.


Connection to Riemannian Geometry

For two nearby distributions and on the Statistical Manifold, KL divergence to second order is:

where is the Fisher Information matrix. The Fisher metric is the local quadratic approximation of KL divergence — this is how information geometry derives the Riemannian structure of probability space from KL divergence, not the other way around.

This is why the Natural Gradient — which uses — is the steepest descent direction in the geometry induced by KL divergence rather than Euclidean distance.


Properties

  • Non-negative: (Gibbs’ inequality), with equality iff
  • Asymmetric: in general — it is not a true metric
  • Not a distance: violates triangle inequality and symmetry
  • Additive: for independent variables,

See Also