The reward model translates human preferences into a scalar signal that RL can optimize. This post covers how it’s built from a pretrained transformer, trained on pairwise comparisons using the Bradley-Terry framework, and why the probabilistic foundation matters.
Purpose
The reward model is a proxy for human judgment. Humans provide preference labels (“I prefer response A over response B”), and the reward model learns to assign scalar scores to responses such that preferred responses score higher.
Architecture
Take a pretrained transformer (typically initialized from the SFT model), remove the final token-prediction head (hidden_dim → vocab_size), and replace it with a single linear layer (hidden_dim → 1).
Language Model head: 4096 → 50,000 (probability over vocabulary)
Reward Model head: 4096 → 1 (scalar score)
The transformer body is identical — same layers, same attention + FFN blocks. Only the final projection changes.
How It Produces a Score
- The reward model does not generate anything
- It receives the full (prompt + response) concatenated as one input
- One forward pass through the transformer produces hidden states at every token position
- Take the hidden state at the last token position only (it has attended to everything via causal attention)
- Project that hidden state to a single scalar → the reward score
- Different response lengths don’t matter — each produces a hidden state at its own last position
Training
Collecting Preference Data
- Take the SFT model (single set of weights)
- Give it a prompt
- Sample two or more different completions (same weights, different random draws during sampling)
- Show both completions to a human labeler
- The human says “I prefer response A over response B”
- Result: dataset of (prompt, preferred response, rejected response) triples
Loss Function
For each training example, two forward passes:
[prompt + preferred response] → score_preferred
[prompt + rejected response] → score_rejected
Loss:
L = -log(sigmoid(score_preferred - score_rejected))
How the Loss Works
Sigmoid maps the score difference to [0, 1]:
sigmoid(x) = 1 / (1 + e^(-x))
-log converts to a loss:
| Scenario | Difference | sigmoid | -log | Loss |
|---|---|---|---|---|
| Correct ranking (large gap) | +4 | 0.98 | 0.02 | Small |
| Can’t distinguish | 0 | 0.50 | 0.69 | Moderate |
| Wrong ranking | -4 | 0.02 | 4.02 | Large |
The model is penalized much more harshly for confident mistakes than it’s rewarded for confident correct answers.
Absolute Scale Doesn’t Matter
Only the difference between scores matters. Scores of (95, 90) produce the same loss as (5, 0) — both have difference = 5. There’s no incentive to push scores apart once the ranking is already correct.
Bradley-Terry Foundation
The loss isn’t an arbitrary penalty — it comes from the Bradley-Terry model of pairwise comparison:
P(i > j) = s_i / (s_i + s_j)
Reparameterize strengths as exponentials: s_i = e^(r_i):
P(i > j) = e^(r_i) / (e^(r_i) + e^(r_j))
= 1 / (1 + e^(-(r_i - r_j)))
= sigmoid(r_i - r_j)
The reward model loss is the negative log-likelihood of observed human preferences under this model.
Why the Probabilistic Interpretation Matters
- Handles noisy labelers: If 7/10 labelers prefer A, the model learns P(A > B) ≈ 0.7 — a moderate score gap, not an extreme one. Contradictory labels don’t fight each other.
- Calibration: A score difference of 2 means ~88% preference probability everywhere (
sigmoid(2) ≈ 0.88), making reward signals consistent across different prompts. - Theoretical guarantees: Maximum likelihood estimation is consistent — with enough data, you converge to the true preference ordering.
Key Q&A
Q: What do you mean by “replace the final token prediction head with a single linear layer”? A: A “head” is just a final output layer on top of the transformer body. The language model head is a matrix multiplication from hidden_dim (4096) to vocab_size (50,000). The reward model head replaces this with a matrix multiplication from 4096 to 1 — a single scalar output. Both are linear layers (matrix multiply, no activation function).
Q: What happens if responses have different lengths? A: Doesn’t matter. Both go through one forward pass. Both produce a hidden state at their respective last token position. Both get projected to one number. The last token works as a summary because causal attention lets it attend to everything before it.
Q: Does the response generation and reward model training happen at the same time? A: No, completely separate. First, generate two responses using the SFT model (autoregressive sampling). Then, feed each complete (prompt + response) into the reward model as a single forward pass and get scores. Generation and scoring are independent processes.
Q: How is sigmoid(score_A - score_B) related to the Bradley-Terry model?
A: Bradley-Terry uses positive ratio scores. The reward model uses unbounded real-valued scores. The bridge is exponential reparameterization: s_i = e^(r_i). Substituting this into the Bradley-Terry formula and simplifying yields exactly sigmoid(r_i - r_j).
Q: Why use this probabilistic loss instead of any function that penalizes wrong rankings? A: Many functions could penalize wrong rankings (hinge loss, squared difference). The probabilistic version wins because: (1) it gracefully handles disagreement among labelers, (2) scores are calibrated consistently across examples, (3) maximum likelihood has theoretical convergence guarantees that arbitrary penalty functions lack.