Post-Training (5/8) — DPO (Direct Preference Optimization)

DPO asks: what if you could skip the reward model and RL loop entirely? This post covers the mathematical insight that makes it work, how the loss function mirrors reward model training using log probability ratios, and why distribution shift is the price you pay for simplicity.

Motivation

RLHF is complex and expensive:

Train a separate reward model
Keep four models in memory simultaneously
Do expensive autoregressive generation during training
Run PPO with clipping, mini-epochs, advantage estimation
Tune many hyperparameters (β, ε, learning rate, mini-epoch count)

DPO asks: what if you could skip all of that and directly fine-tune the language model on human preference data?

The Key Mathematical Insight

The RLHF objective is:

maximize: reward(response) - β × KL(policy || reference)

The DPO paper proved that the optimal policy under this objective has a closed-form relationship with the reward and the reference model. When you substitute this relationship back into the Bradley-Terry preference model, the reward cancels out entirely.

What remains is a loss that only involves the policy model and the reference model — no reward model needed.

The DPO Loss Function

L_DPO = -log sigmoid(β × [(log P_policy(pref) - log P_ref(pref))
                          - (log P_policy(rej) - log P_ref(rej))])

Structure Parallel with Reward Model Loss

Reward model: -log sigmoid(score_preferred       - score_rejected)
DPO:          -log sigmoid(β × [log_ratio_pref   - log_ratio_rej])

The same Bradley-Terry / sigmoid / -log structure. But instead of learned reward scores, DPO uses the log probability ratio log P_policy - log P_ref as an implicit reward — “how much more likely does the policy find this response compared to the reference?”

What Each P Refers To

P_policy(preferred) is the probability of the entire preferred response — the product of all token probabilities:

log P_policy(pref) = log P(t1) + log P(t2|t1) + log P(t3|t1,t2) + ...

Same for the rejected response and for the reference model. Only response tokens contribute to the loss — prompt tokens are context only (same as SFT).

Training: Four Forward Passes Per Example

Policy model on preferred response → log P_policy(pref)
Policy model on rejected response → log P_policy(rej)
Reference model on preferred response → log P_ref(pref)
Reference model on rejected response → log P_ref(rej)

Compute loss, backpropagate, update only the policy model weights. The reference model stays frozen.

What the Loss Pushes the Model to Do

Make the preferred response more likely relative to the reference model
Make the rejected response less likely relative to the reference model

The reference model ratio prevents the policy from simply increasing probability for all text — it must specifically widen the gap between preferred and rejected responses.

Why DPO Doesn’t Need the Value Network or Reward Model

No Reward Model

The log probability ratio acts as an implicit reward. The mathematical derivation shows this is equivalent to the explicit reward under the RLHF objective.

No Value Network

The value network in RLHF exists for credit assignment — figuring out which tokens in a generated response were responsible for the reward. DPO doesn’t generate responses during training. It uses existing preference pairs from a fixed dataset and treats entire responses as units. No generation means no credit assignment problem. All tokens in the preferred response get pushed up together, all tokens in the rejected response get pushed down together.

No Generation During Training

DPO does forward passes on existing text (like SFT), not autoregressive token-by-token generation (like PPO). This is dramatically cheaper.

Models in Memory

Only two:

Policy model — being updated
Reference model — frozen SFT copy

Compared to RLHF’s four models (policy, reference, reward, value).

DPO vs. RLHF Comparison

	RLHF + PPO	DPO
Models in memory	4	2
Generation during training	Yes (expensive)	No
Hyperparameters	Many (β, ε, LR, mini-epochs)	Few (β, LR)
Training complexity	High	Low
On-policy learning	Yes — evaluates model’s own outputs	No — fixed dataset
Distribution shift risk	Low	Higher

DPO’s Limitation: Distribution Shift

The preference data was generated by an earlier model (usually the SFT model). As the policy improves during DPO training, it may assign low probability to both the preferred and rejected responses — it would generate something entirely different and better than either. The training signal becomes “slightly prefer this low-probability response over that other low-probability response” — which isn’t useful.

RLHF doesn’t have this problem because it generates fresh responses from the current policy at every iteration and scores them with the reward model.

Mitigations

Iterative DPO: After a round of DPO, use the improved policy to generate new responses, collect new preferences, run DPO again
Rejection sampling: Generate many responses from the current policy, use a reward model to pick best/worst, create fresh preference pairs

These additions reintroduce some complexity but DPO remains far simpler than full RLHF. Despite the limitation, DPO has been widely adopted — models like Llama 3 and Zephyr used DPO for alignment.

Key Q&A

Q: The DPO loss looks like the reward model loss — is it just borrowing that structure and using probability ratios instead of reward scores? A: Exactly. The structure is identical Bradley-Terry — sigmoid of a difference, wrapped in -log. The difference is what’s being compared: reward model compares learned scalar scores, DPO compares log probability ratios between the policy and reference models. The log ratio acts as an implicit reward.

Q: Why doesn’t DPO need a value network? A: The value network solves credit assignment for generated responses — “which tokens mattered?” DPO doesn’t generate anything during training. It takes existing response pairs and treats them as whole units. All tokens in the preferred response get pushed up together, all in the rejected get pushed down. No per-token attribution needed.

Q: What if the policy gets much better and both responses become very low probability? A: This is the distribution shift problem. The training signal weakens because you’re learning from responses the current model wouldn’t produce. Mitigations include iterative DPO (regenerate preference data with the improved model) and rejection sampling.