DPO

A good paper. Even the “Preliminaries” part is very interesting that I feel might warrant a separate note. it’s now in RLHF.

The following note is generated from one of my discussion with Claude Sonnet 4.6

DPO eliminates the explicit reward model and RL loop from RLHF by reparameterizing the reward in terms of the policy itself. The key insight: the policy is the reward model, via a log-ratio with the reference.

Setup

Standard RLHF maximizes a KL-regularized reward objective:

π max E_{x \sim D,, y \sim π} [r (x, y)] - β, D_{KL} [π ∣ π_{ref}]

where $x$ is the prompt and $y$ is the full generated response (a complete token sequence). The KL penalty keeps the policy from drifting too far from the reference (SFT) model.

Note this is reverse KL divergence $D_{K L} (π ∥ π_{ref})$ — mode-seeking around the reference. The direction matters: the closed-form exponential-family solution below comes from the variational characterization where you minimize $D_{K L} (π π_{ref} \cdot exp (r / β) / Z)$ , and that decomposition requires this KL direction. Forward KL wouldn’t give the same clean result.

Derivation

Step 1: Closed-form optimal policy

The KL-regularized objective has an analytic solution for any reward $r$ :

π^{*} (y ∣ x) = \frac{1}{Z ( x )}, π_{ref} (y ∣ x), exp (\frac{1}{β} r (x, y))

where $Z (x) = \sum_{y} π_{ref} (y ∣ x) exp (\frac{r ( x , y )}{β})$ is the partition function.

$Z (x)$ is not just expensive — it's intractable

$Z (x)$ sums over all possible token sequences of all lengths — a combinatorially infinite space. This is the same fundamental intractability as in energy-based models and undirected graphical models. It cannot be evaluated, period. This is why the cancellation in Step 3 is essential, not merely convenient.

Step 2: Invert — express reward in terms of policy

Instead of reward → policy, flip it. Take logs of the optimal policy equation and rearrange for $r$ :

r (x, y) = β lo g \frac{π ^{*} ( y ∣ x )}{π _{ref} ( y ∣ x )} + β lo g Z (x)

The reward is a log-ratio of optimal policy to reference, plus a ==term that depends only on $x$ , not $y$ ==.

Step 3: Plug into Bradley-Terry preference model

Human preferences are modeled as:

p (y_{w} ≻ y_{l} ∣ x) = σ! (r (x, y_{w}) - r (x, y_{l}))

Substituting the reparameterized reward, the $β lo g Z (x)$ terms cancel (same $x$ , same prompt):

p (y_{w} ≻ y_{l} ∣ x) = σ! (β lo g \frac{π ^{*} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π ^{*} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})

The cancellation depends structurally on pairwise comparisons of completions from the same prompt. Two pieces have to line up:

Same prompt $x$ → same $Z (x)$ → cancels in the difference.
Pairwise (not scalar) → the comparison takes a difference of rewards, which is what eliminates $Z (x)$ . A single-completion scalar-reward objective leaves $Z (x)$ intact and DPO doesn’t apply.

This is why BT preference data is uniquely compatible with the trick. Listwise rankings within a prompt work too (pairwise decomposition), but cross-prompt comparisons or absolute-score targets do not.

Step 4: MLE with trainable $π_{θ}$

Step 3 gives a statistical model for preference probabilities parameterized by the policy. Replace $π^{*}$ with trainable $π_{θ}$ and do MLE on the preference dataset:

L_{DPO} (π_{θ}) = - E_{(x, y_{w}, y_{l})} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})]

This is binary cross-entropy. No reward model, no RL rollouts, no PPO.

Why can we substitute $π_{θ}$ for $π^{*}$ ?

This is just MLE — not a policy iteration argument. You have a parameterized family of distributions over preference pairs. You assume $π^{*}$ lies within (or is well-approximated by) $π_{θ}$ . MLE on a well-specified model recovers the true parameters. The sophistication was entirely in showing preferences can be written as policy log-ratios; the optimization step is standard.

DPO in the discrete-sampling taxonomy

Step back. The objective DPO and PPO both face is:

θ max E_{y \sim π_{θ} (\cdot ∣ x)} [r (x, y)]

(with KL regularization in both cases). The gradient $\nabla_{θ} E_{y \sim π_{θ}} [r (x, y)]$ has the same problem as the VAE gradient $\nabla_{ϕ} E_{q_{ϕ}} [f (z)]$ : $y$ is a discrete token sequence, so the Reparameterization trick doesn’t apply. Two strategies exist:

Fight through it. Use the Score function estimator $\nabla E [r] = E [r \nabla lo g π]$ . PPO is the canonical example — REINFORCE-style gradients with clipping and importance ratios as variance reduction. Same family as Gumbel-softmax / straight-through for discrete VAE latents, like VQ-VAE: a specific trick to make the high-variance estimator workable.

Avoid it. Reformulate so sampling never appears in the loss. DPO is the cleanest example — the algebraic chain (closed-form optimum → reward reparameterization → BT cancellation) replaces the entire expectation $E_{y \sim π_{θ}} [\cdot]$ with log-probabilities of pre-collected completions $(y_{w}, y_{l})$ . KTO, IPO, and SimPO follow the same template with different preference assumptions.

So DPO is not “RLHF without the RL part” in some superficial sense — it’s a fundamentally different strategy for the same underlying problem. PPO computes a noisy estimator of an intractable gradient; DPO algebraically transforms the problem into one where no such gradient appears.

Key Intuitions

The BT model connection is central — DPO takes it seriously as a latent variable model where the latent is the policy itself
RLHF trained a separate reward model as an intermediate (BT regression with a scalar head), then ran PPO against it. DPO collapses both stages by exploiting the closed-form KL-constrained solution
$π_{θ}$ implicitly represents the reward through its log-ratio with $π_{ref}$

The General Pattern

The DPO trick generalizes wherever you see:

A latent quantity (reward, value, energy) you don’t want to model explicitly
A KL-regularized objective with a closed-form optimal of the form $π_{ref} \cdot exp (something / β)$
Pairwise comparisons that let the intractable normalizer cancel

Broader applicability

Any exponential family model with a KL constraint has a closed-form optimal that looks like $π_{ref} \cdot exp (\cdot)$ , so wherever you see KL-regularized optimization + pairwise comparisons, DPO-style reasoning likely applies.

The same exponential-family algebra appears in the EM E-step (see Variational inference): unconstrained $q$ maximizing the ELBO gives $q^{*} (z) = p (z ∣ x) \propto p (x ∣ z) p (z)$ — same closed-form structure, different surface application. KL-regularized reward maximization is essentially variational inference with $π_{ref}$ playing the role of the prior and $r / β$ playing the role of $lo g p (x ∣ z)$ . The intractable $Z (x)$ in DPO is the intractable evidence $p (x)$ in VI under another name.

Yanda's Random Notes

Explorer

DPO

Setup

Derivation

Step 1: Closed-form optimal policy

Step 2: Invert — express reward in terms of policy

Step 3: Plug into Bradley-Terry preference model

Step 4: MLE with trainable $π_{θ}$

DPO in the discrete-sampling taxonomy

Key Intuitions

The General Pattern

Graph View

Table of Contents

Backlinks

Yanda's Random Notes

Explorer

DPO

Setup

Derivation

Step 1: Closed-form optimal policy

Step 2: Invert — express reward in terms of policy

Step 3: Plug into Bradley-Terry preference model

Step 4: MLE with trainable πθ​

DPO in the discrete-sampling taxonomy

Key Intuitions

The General Pattern

Graph View

Table of Contents

Backlinks

Step 4: MLE with trainable $π_{θ}$