Actually the note is from me reading Preliminaries section from DPO paper, not the typically cited InstructGPT paper.
The following is from my conversation with Claude Sonnet 4.6, with my modifications.
RLHF setup
In the second phase the SFT model is prompted with prompts to produce pairs of answers . These are then presented to human labelers who express preferences for one answer, denoted as where and denotes the preferred and dispreferred completion amongst respectively. The preferences are assumed to be generated by some latent reward model , which we do not have access to.
So basically we go from pairwise preference to a scalar reward value to learn, assuming they are from a “reward model”, and then we optimize our model to maximize that reward, in some way.
Bradley-Terry Loss
Step 1: what signal do humans actually provide?
Human labels are relative — ” is better than ” — not absolute scores. So we want to model a pairwise preference probability, not regress to absolute values. A pointwise scheme (label 1 for , label 0 for ) would work but forces the model to hit specific absolute values, which isn’t what the data says.
Step 2: turn the reward difference into a probability
We want as a function of . The natural choice: take the difference and squash it into with a sigmoid. This is the Bradley-Terry model:
The sigmoid does exactly one job: reduce the difference to a probability. A consequence is that absolute scale becomes irrelevant — and give the same difference, same probability, same loss. Only the gap matters, which is faithful to what humans actually told us.
Step 3: MLE on this probability
Maximize the likelihood of the observed preferences over the dataset → take log → negate:
Relation to logistic regression
With this is , identical to logistic regression on a positive example. But unlike standard logistic regression, there are no explicit negative labels — the negative signal is implicit in the pair. When you train on , you simultaneously push up and down relative to each other. Each pair provides one positive and one negative signal jointly, through the single scalar difference.
Common ancestor
BT, logistic regression, and cross-entropy are all siblings — children of Bernoulli MLE. The sigmoid appears in all of them as the natural link function mapping in the exponential family sense.
RL Fine-Tuning Objective
In practice the reward is folded in as:
Two different KLs.
| PPO/TRPO KL | This KL | |
|---|---|---|
| Between | old policy vs new policy | vs fixed |
| Purpose | Trust region — stable optimization steps | Semantic anchor — prevent mode collapse, stay in-distribution for reward model |
| Lifetime | Resets each update | Persistent throughout all training |
The entropy bonus in PPO () is also related but weaker: it just says “be spread out.” The KL to says “be spread out in the same way the reference model is” — it anchors where the mass goes, not just that it’s distributed.
Is static?
Yes, typically a fixed hyperparameter — and suspicious for good reason:
- Early training: policy ≈ ref, KL ≈ 0, barely matters
- Late training: policy has drifted, KL dominates
Some works anneal or tune it adaptively, but static is the norm. DPO sidesteps this by baking into the closed-form solution rather than driving a live RL loop.
Why Not Just Backprop? The Discrete Sampling Problem
The objective is . The only -dependence is in , but is a discrete sample — once you commit to a token sequence, there’s no gradient flowing back through that choice.
The Reparameterization trick (VAE analogy)
In a VAE, the encoder outputs and you need to sample :
- ❌ Naive: — not differentiable
- ✅ Reparametrized: , — differentiable, randomness pushed into parameter-free
For discrete tokens there’s no equivalent — you can’t write “token 42” as a differentiable function of logits.
The differentiable generator case: LPIPS
In image generation, the analogous problem is solved trivially — the VAE decoder is differentiable, so you can backprop a learned perceptual reward directly into the generator. LPIPS (Zhang et al. 2018) does exactly this: freeze a pretrained VGG, learn only a tiny linear weighting over its feature layers from human perceptual judgments, use it as a loss. No RL needed. The discrete token sampling problem is precisely what makes RLHF complicated where LPIPS is simple. See LDM.
The MoE / Gumbel-Softmax connection
The reparametrization trick is the same idea as the original MoE routing trick: replace a hard discrete choice with a soft weighted sum. Forward pass can still be hard (argmax, for efficiency); backward pass uses the soft (softmax) version — the straight-through estimator.
For LLM token generation you could apply this: instead of sampling one token, take a weighted sum over all token embeddings weighted by . But this breaks down because:
- The blended embedding is out-of-distribution — the model was trained on one-hot inputs
- Errors compound across steps — by step 5 you’re on a completely different manifold
- The reward model was trained on real text, not blended pseudo-sequences
The key distinction vs images
Mixup works in image classification because it’s a single-step operation — there’s no compounding. And the label is also mixed consistently. For autoregressive generation, what’s the “mixed target” for a mixed embedding? The supervision is only defined on real completed sequences.
Why Policy Gradient works
The log-derivative trick sidesteps differentiating through the sample entirely:
Roll out real tokens → get real reward → weight the log-prob gradient by reward. No backprop through sampling needed. The critic in actor-critic is just variance reduction on this (advantage = instead of raw ).