Actually the note is from me reading Preliminaries section from DPO paper, not the typically cited InstructGPT paper.


The following is from my conversation with Claude Sonnet 4.6, with my modifications.


RLHF setup

In the second phase the SFT model is prompted with prompts to produce pairs of answers . These are then presented to human labelers who express preferences for one answer, denoted as where and denotes the preferred and dispreferred completion amongst respectively. The preferences are assumed to be generated by some latent reward model , which we do not have access to.

dpo, page 3

So basically we go from pairwise preference to a scalar reward value to learn, assuming they are from a “reward model”, and then we optimize our model to maximize that reward, in some way.

Bradley-Terry Loss

Step 1: what signal do humans actually provide?

Human labels are relative — ” is better than ” — not absolute scores. So we want to model a pairwise preference probability, not regress to absolute values. A pointwise scheme (label 1 for , label 0 for ) would work but forces the model to hit specific absolute values, which isn’t what the data says.

Step 2: turn the reward difference into a probability

We want as a function of . The natural choice: take the difference and squash it into with a sigmoid. This is the Bradley-Terry model:

The sigmoid does exactly one job: reduce the difference to a probability. A consequence is that absolute scale becomes irrelevant — and give the same difference, same probability, same loss. Only the gap matters, which is faithful to what humans actually told us.

Step 3: MLE on this probability

Maximize the likelihood of the observed preferences over the dataset → take log → negate:

Relation to logistic regression

With this is , identical to logistic regression on a positive example. But unlike standard logistic regression, there are no explicit negative labels — the negative signal is implicit in the pair. When you train on , you simultaneously push up and down relative to each other. Each pair provides one positive and one negative signal jointly, through the single scalar difference.

Common ancestor

BT, logistic regression, and cross-entropy are all siblings — children of Bernoulli MLE. The sigmoid appears in all of them as the natural link function mapping in the exponential family sense.


RL Fine-Tuning Objective

In practice the reward is folded in as:

Two different KLs.

PPO/TRPO KLThis KL
Betweenold policy vs new policy vs fixed
PurposeTrust region — stable optimization stepsSemantic anchor — prevent mode collapse, stay in-distribution for reward model
LifetimeResets each updatePersistent throughout all training

The entropy bonus in PPO () is also related but weaker: it just says “be spread out.” The KL to says “be spread out in the same way the reference model is” — it anchors where the mass goes, not just that it’s distributed.

Is static?

Yes, typically a fixed hyperparameter — and suspicious for good reason:

  • Early training: policy ≈ ref, KL ≈ 0, barely matters
  • Late training: policy has drifted, KL dominates

Some works anneal or tune it adaptively, but static is the norm. DPO sidesteps this by baking into the closed-form solution rather than driving a live RL loop.


Why Not Just Backprop? The Discrete Sampling Problem

The objective is . The only -dependence is in , but is a discrete sample — once you commit to a token sequence, there’s no gradient flowing back through that choice.

The Reparameterization trick (VAE analogy)

In a VAE, the encoder outputs and you need to sample :

  • ❌ Naive: — not differentiable
  • ✅ Reparametrized: , — differentiable, randomness pushed into parameter-free

For discrete tokens there’s no equivalent — you can’t write “token 42” as a differentiable function of logits.

The differentiable generator case: LPIPS

In image generation, the analogous problem is solved trivially — the VAE decoder is differentiable, so you can backprop a learned perceptual reward directly into the generator. LPIPS (Zhang et al. 2018) does exactly this: freeze a pretrained VGG, learn only a tiny linear weighting over its feature layers from human perceptual judgments, use it as a loss. No RL needed. The discrete token sampling problem is precisely what makes RLHF complicated where LPIPS is simple. See LDM.

The MoE / Gumbel-Softmax connection

The reparametrization trick is the same idea as the original MoE routing trick: replace a hard discrete choice with a soft weighted sum. Forward pass can still be hard (argmax, for efficiency); backward pass uses the soft (softmax) version — the straight-through estimator.

For LLM token generation you could apply this: instead of sampling one token, take a weighted sum over all token embeddings weighted by . But this breaks down because:

  1. The blended embedding is out-of-distribution — the model was trained on one-hot inputs
  2. Errors compound across steps — by step 5 you’re on a completely different manifold
  3. The reward model was trained on real text, not blended pseudo-sequences

The key distinction vs images

Mixup works in image classification because it’s a single-step operation — there’s no compounding. And the label is also mixed consistently. For autoregressive generation, what’s the “mixed target” for a mixed embedding? The supervision is only defined on real completed sequences.

Why Policy Gradient works

The log-derivative trick sidesteps differentiating through the sample entirely:

Roll out real tokens → get real reward → weight the log-prob gradient by reward. No backprop through sampling needed. The critic in actor-critic is just variance reduction on this (advantage = instead of raw ).