RLHF

Actually the note is from me reading Preliminaries section from DPO paper, not the typically cited InstructGPT paper.

The following is from my conversation with Claude Sonnet 4.6, with my modifications.

RLHF setup

In the second phase the SFT model is prompted with prompts $x$ to produce pairs of answers $(y_{1}, y_{2}) \sim π^{SFT} (y ∣ x)$ . These are then presented to human labelers who express preferences for one answer, denoted as $y_{w} ≻ y_{l} ∣ x$ where $y_{w}$ and $y_{l}$ denotes the preferred and dispreferred completion amongst $(y_{1}, y_{2})$ respectively. The preferences are assumed to be generated by some latent reward model $r^{*} (y, x)$ , which we do not have access to.

dpo, page 3

So basically we go from pairwise preference to a scalar reward value to learn, assuming they are from a “reward model”, and then we optimize our model to maximize that reward, in some way.

Bradley-Terry Loss

Step 1: what signal do humans actually provide?

Human labels are relative — ” $y_{w}$ is better than $y_{l}$ ” — not absolute scores. So we want to model a pairwise preference probability, not regress to absolute values. A pointwise scheme (label 1 for $y_{w}$ , label 0 for $y_{l}$ ) would work but forces the model to hit specific absolute values, which isn’t what the data says.

Step 2: turn the reward difference into a probability

We want $P (y_{w} ≻ y_{l} ∣ x)$ as a function of $r_{ϕ}$ . The natural choice: take the difference $r_{w} - r_{l} \in R$ and squash it into $(0, 1)$ with a sigmoid. This is the Bradley-Terry model:

P (y_{w} ≻ y_{l} ∣ x) = σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))

The sigmoid does exactly one job: reduce the difference to a probability. A consequence is that absolute scale becomes irrelevant — $(100, 99)$ and $(1, 0)$ give the same difference, same probability, same loss. Only the gap matters, which is faithful to what humans actually told us.

Step 3: MLE on this probability

Maximize the likelihood of the observed preferences over the dataset → take log → negate:

L_{R} (r_{ϕ}, D) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))]

Relation to logistic regression

With $z := r_{w} - r_{l}$ this is $- lo g σ (z)$ , identical to logistic regression on a positive example. But unlike standard logistic regression, there are no explicit negative labels — the negative signal is implicit in the pair. When you train on $(y_{w}, y_{l})$ , you simultaneously push $r_{w}$ up and $r_{l}$ down relative to each other. Each pair provides one positive and one negative signal jointly, through the single scalar difference.

Common ancestor

BT, logistic regression, and cross-entropy are all siblings — children of Bernoulli MLE. The sigmoid appears in all of them as the natural link function mapping $R \to (0, 1)$ in the exponential family sense.

RL Fine-Tuning Objective

π_{θ} max E_{x \sim D, y \sim π_{θ} (y ∣ x)} [r_{ϕ} (x, y)] - β D_{KL} [π_{θ} (y ∣ x) ∣ π_{ref} (y ∣ x)]

In practice the reward is folded in as:

r (x, y) = r_{ϕ} (x, y) - β (lo g π_{θ} (y ∣ x) - lo g π_{ref} (y ∣ x))

Two different KLs.

	PPO/TRPO KL	This KL
Between	old policy vs new policy	$π_{θ}$ vs fixed $π_{ref}$
Purpose	Trust region — stable optimization steps	Semantic anchor — prevent mode collapse, stay in-distribution for reward model
Lifetime	Resets each update	Persistent throughout all training

The entropy bonus in PPO ( $H (π_{θ})$ ) is also related but weaker: it just says “be spread out.” The KL to $π_{ref}$ says “be spread out in the same way the reference model is” — it anchors where the mass goes, not just that it’s distributed.

Is $β$ static?

Yes, typically a fixed hyperparameter — and suspicious for good reason:

Early training: policy ≈ ref, KL ≈ 0, $β$ barely matters
Late training: policy has drifted, KL dominates

Some works anneal or tune it adaptively, but static $β$ is the norm. DPO sidesteps this by baking $β$ into the closed-form solution rather than driving a live RL loop.

Why Not Just Backprop? The Discrete Sampling Problem

The objective is $E_{y \sim π_{θ}} [r (y)]$ . The only $θ$ -dependence is in $π_{θ} (y)$ , but $y$ is a discrete sample — once you commit to a token sequence, there’s no gradient flowing back through that choice.

The Reparameterization trick (VAE analogy)

In a VAE, the encoder outputs $μ, σ$ and you need to sample $z \sim N (μ, σ^{2})$ :

❌ Naive: $z = sample (μ_{θ}, σ_{θ})$ — not differentiable
✅ Reparametrized: $z = μ_{θ} + σ_{θ} \cdot ϵ$ , $ϵ \sim N (0, 1)$ — differentiable, randomness pushed into parameter-free $ϵ$

For discrete tokens there’s no equivalent — you can’t write “token 42” as a differentiable function of logits.

The differentiable generator case: LPIPS

In image generation, the analogous problem is solved trivially — the VAE decoder is differentiable, so you can backprop a learned perceptual reward directly into the generator. LPIPS (Zhang et al. 2018) does exactly this: freeze a pretrained VGG, learn only a tiny linear weighting over its feature layers from human perceptual judgments, use it as a loss. No RL needed. The discrete token sampling problem is precisely what makes RLHF complicated where LPIPS is simple. See LDM.

The MoE / Gumbel-Softmax connection

The reparametrization trick is the same idea as the original MoE routing trick: replace a hard discrete choice with a soft weighted sum. Forward pass can still be hard (argmax, for efficiency); backward pass uses the soft (softmax) version — the straight-through estimator.

For LLM token generation you could apply this: instead of sampling one token, take a weighted sum over all token embeddings weighted by $π_{θ}$ . But this breaks down because:

The blended embedding $\tilde{e}_{t} = \sum_{v} π_{θ} (v) \cdot e_{v}$ is out-of-distribution — the model was trained on one-hot inputs
Errors compound across $T$ steps — by step 5 you’re on a completely different manifold
The reward model was trained on real text, not blended pseudo-sequences

The key distinction vs images

Mixup works in image classification because it’s a single-step operation — there’s no compounding. And the label is also mixed consistently. For autoregressive generation, what’s the “mixed target” for a mixed embedding? The supervision is only defined on real completed sequences.

Why Policy Gradient works

The log-derivative trick sidesteps differentiating through the sample entirely:

\nabla_{θ} E_{y \sim π_{θ}} [r (y)] = E_{y \sim π_{θ}} [r (y) \nabla_{θ} lo g π_{θ} (y)]

Roll out real tokens → get real reward → weight the log-prob gradient by reward. No backprop through sampling needed. The critic in actor-critic is just variance reduction on this (advantage = $r (y) - V (x)$ instead of raw $r (y)$ ).

Yanda's Random Notes

Explorer

RLHF

RLHF setup

Bradley-Terry Loss

Step 1: what signal do humans actually provide?

Step 2: turn the reward difference into a probability

Step 3: MLE on this probability

Relation to logistic regression

RL Fine-Tuning Objective

Two different KLs.

Is $β$ static?

Why Not Just Backprop? The Discrete Sampling Problem

The Reparameterization trick (VAE analogy)

The MoE / Gumbel-Softmax connection

Why Policy Gradient works

Graph View

Table of Contents

Backlinks

Yanda's Random Notes

Explorer

RLHF

RLHF setup

Bradley-Terry Loss

Step 1: what signal do humans actually provide?

Step 2: turn the reward difference into a probability

Step 3: MLE on this probability

Relation to logistic regression

RL Fine-Tuning Objective

Two different KLs.

Is β static?

Why Not Just Backprop? The Discrete Sampling Problem

The Reparameterization trick (VAE analogy)

The MoE / Gumbel-Softmax connection

Why Policy Gradient works

Graph View

Table of Contents

Backlinks

Is $β$ static?