A good paper. Even the “Preliminaries” part is very interesting that I feel might warrant a separate note. it’s now in RLHF.
The following note is generated from one of my discussion with Claude Sonnet 4.6
DPO eliminates the explicit reward model and RL loop from RLHF by reparameterizing the reward in terms of the policy itself. The key insight: the policy is the reward model, via a log-ratio with the reference.
Setup
Standard RLHF maximizes a KL-regularized reward objective:
where is the prompt and is the full generated response (a complete token sequence). The KL penalty keeps the policy from drifting too far from the reference (SFT) model.
Note this is reverse KL divergence — mode-seeking around the reference. The direction matters: the closed-form exponential-family solution below comes from the variational characterization where you minimize , and that decomposition requires this KL direction. Forward KL wouldn’t give the same clean result.
Derivation
Step 1: Closed-form optimal policy
The KL-regularized objective has an analytic solution for any reward :
where is the partition function.
is not just expensive — it's intractable
sums over all possible token sequences of all lengths — a combinatorially infinite space. This is the same fundamental intractability as in energy-based models and undirected graphical models. It cannot be evaluated, period. This is why the cancellation in Step 3 is essential, not merely convenient.
Step 2: Invert — express reward in terms of policy
Instead of reward → policy, flip it. Take logs of the optimal policy equation and rearrange for :
The reward is a log-ratio of optimal policy to reference, plus a ==term that depends only on , not ==.
Step 3: Plug into Bradley-Terry preference model
Human preferences are modeled as:
Substituting the reparameterized reward, the terms cancel (same , same prompt):
The cancellation depends structurally on pairwise comparisons of completions from the same prompt. Two pieces have to line up:
- Same prompt → same → cancels in the difference.
- Pairwise (not scalar) → the comparison takes a difference of rewards, which is what eliminates . A single-completion scalar-reward objective leaves intact and DPO doesn’t apply.
This is why BT preference data is uniquely compatible with the trick. Listwise rankings within a prompt work too (pairwise decomposition), but cross-prompt comparisons or absolute-score targets do not.
Step 4: MLE with trainable
Step 3 gives a statistical model for preference probabilities parameterized by the policy. Replace with trainable and do MLE on the preference dataset:
This is binary cross-entropy. No reward model, no RL rollouts, no PPO.
Why can we substitute for ?
This is just MLE — not a policy iteration argument. You have a parameterized family of distributions over preference pairs. You assume lies within (or is well-approximated by) . MLE on a well-specified model recovers the true parameters. The sophistication was entirely in showing preferences can be written as policy log-ratios; the optimization step is standard.
DPO in the discrete-sampling taxonomy
See also Why Not Just Backprop? The Discrete Sampling Problem.
Step back. The objective DPO and PPO both face is:
(with KL regularization in both cases). The gradient has the same problem as the VAE gradient : is a discrete token sequence, so the Reparameterization trick doesn’t apply. Two strategies exist:
Fight through it. Use the Score function estimator . PPO is the canonical example — REINFORCE-style gradients with clipping and importance ratios as variance reduction. Same family as Gumbel-softmax / straight-through for discrete VAE latents, like VQ-VAE: a specific trick to make the high-variance estimator workable.
Avoid it. Reformulate so sampling never appears in the loss. DPO is the cleanest example — the algebraic chain (closed-form optimum → reward reparameterization → BT cancellation) replaces the entire expectation with log-probabilities of pre-collected completions . KTO, IPO, and SimPO follow the same template with different preference assumptions.
So DPO is not “RLHF without the RL part” in some superficial sense — it’s a fundamentally different strategy for the same underlying problem. PPO computes a noisy estimator of an intractable gradient; DPO algebraically transforms the problem into one where no such gradient appears.
Key Intuitions
- The BT model connection is central — DPO takes it seriously as a latent variable model where the latent is the policy itself
- RLHF trained a separate reward model as an intermediate (BT regression with a scalar head), then ran PPO against it. DPO collapses both stages by exploiting the closed-form KL-constrained solution
- implicitly represents the reward through its log-ratio with
The General Pattern
The DPO trick generalizes wherever you see:
- A latent quantity (reward, value, energy) you don’t want to model explicitly
- A KL-regularized objective with a closed-form optimal of the form
- Pairwise comparisons that let the intractable normalizer cancel
Broader applicability
Any exponential family model with a KL constraint has a closed-form optimal that looks like , so wherever you see KL-regularized optimization + pairwise comparisons, DPO-style reasoning likely applies.
The same exponential-family algebra appears in the EM E-step (see Variational inference): unconstrained maximizing the ELBO gives — same closed-form structure, different surface application. KL-regularized reward maximization is essentially variational inference with playing the role of the prior and playing the role of . The intractable in DPO is the intractable evidence in VI under another name.