Generated via Claude 4.6 Opus, resulted from a conversation.
Cross-entropy loss is the standard loss function for classification. It can be derived from multiple perspectives — maximum likelihood estimation, information theory, and KL divergence — all of which converge on the same formula. This page walks through the derivation and then consolidates the views.
Derivation from Maximum Likelihood
Assuming i.i.d. data, the likelihood of the dataset is . Taking the log converts the product to a sum, and flipping the sign (so we can minimize) gives the Negative Log-Likelihood (NLL):
For a single sample with true class , this is simply .
No here
Notice that this derivation never introduces a weighting by . It only says: for each sample, penalize of the predicted probability at the correct class. The connection to the full cross-entropy formula only becomes clear when we aggregate samples — see below.
Cross-Entropy from Information Theory
For two distributions (truth) and (prediction), cross-entropy is defined as:
The weighting comes from taking an expectation under the true distribution. The quantity is the “surprise” or information content of seeing event under model . Cross-entropy is the expected surprise when reality follows but you’re using :
Why weight by and not ?
Because we’re asking: “how costly is it to use code when reality is ?” In coding theory, is the code length assigned to event , and is how often that event actually occurs. The expected message length under reality is what matters.
Collapse to NLL in Classification
In supervised classification, the true label is one-hot: for the correct class, otherwise. The sum collapses:
This is exactly the per-sample NLL from the likelihood derivation.
Why These Views Agree
The agreement is not a coincidence. All three framings answer the same question: how well does approximate ?
The weighting emerges from data
In the empirical setting, we don’t know analytically — we have samples. The empirical distribution is where counts class . The average NLL naturally becomes:
The weighting is not an axiom introduced from information theory — it falls out of aggregating per-sample log-likelihoods by class frequency.
KL divergence as the unifying frame
Both views are special cases of minimizing the KL Divergence between the true and predicted distributions:
Since is constant w.r.t. model parameters, minimizing KL divergence minimizing cross-entropy minimizing NLL.
Summary
| Perspective | What you’re doing | Formula (single sample) |
|---|---|---|
| Maximum Likelihood | Maximize probability of observed data | |
| Information Theory | Minimize expected surprise under true dist. | |
| KL Divergence | Minimize divergence from true to predicted | , constant shift from above |
For one-hot labels, all three reduce to .