From my conversation with Claude Sonnet 4.6


Essentially REINFORCE with a mean baseline — the only twist is that the baseline is estimated on-the-fly via Monte Carlo from rollouts of the same input, rather than from a separately trained critic. From the DeepSeek Math paper.

For a given input , sample responses , get rewards from a reward model, then define the advantage for response as:

Then optimize with a PPO-style clipped surrogate objective using these advantages.

The group mean is an unbiased estimator of — a stochastic baseline that changes every rollout, as opposed to a fixed learned . Each depends on all other in the group, introducing a small self-correlation bias that is negligible for large .

Why it works for reasoning

When reward is binary (correct/incorrect), within-group normalization captures relative quality among responses to the same problem. A problem where the model always succeeds or always fails contributes zero gradient (std collapses) — which is correct behavior.

Both PPO and GRPO use a learned neural reward model (not rule-based), in the DeepSeek Math paper, which is the weak link — see Goodhart’s Law and Process Reward Models for the open problems there.

Dr. GRPO: Fixes to the Original Formulation

The original GRPO loss averages over tokens per response:

Dr. GRPO identifies two problems and removes both:

1. Remove — length normalization biases toward shorter responses: same reward, shorter response → larger per-token gradient. Bad for chain-of-thought. Inherited from LM training where token-averaging makes sense; wrong semantics in the RL setting. Fix: sum token log-probs per response, normalize at the response level via group advantage.

2. Remove std from advantage — dividing by std distorts relative magnitudes when within-group reward variance is low. If all responses are similarly good or bad, std causes instability or over-amplification of noise. Mean subtraction alone is sufficient for variance reduction:

Connection to MCTS and PRMs

GRPO’s “group rollouts from the same state” is structurally the same as MCTS simulation: sample from , backpropagate returns to estimate . A PRM bridges these — it lets you evaluate intermediate reasoning states without rolling to completion, functioning as the value function at MCTS leaf nodes. This enables tree search during training: collect high-quality trajectories via MCTS + PRM, train policy on those. The bottleneck is PRM reliability — a misfit PRM is a Goodhart problem at every step.

General Applicability Beyond LLMs

GRPO is not LLM-specific. The pattern: fix a starting state , “teleport” back to it, run rollouts, compute group-normalized advantages, update policy. LLMs make teleporting trivial (re-feed the prompt). In real environments you need a reset mechanism or world model — at which point this connects naturally to model-based RL.

Why LLMs don't need model-based RL

The “environment” (verifier / code interpreter) is essentially free to query, so the sample efficiency motivation for model-based RL largely disappears. The learned RM is also untrustworthy for imagined rollouts (Goodhart). The model-based analog for LLMs is test-time compute (beam search, MCTS at inference) rather than training-time world models.