ChatGPT doesn't just predict words - it's been carefully trained to give answers humans prefer. But the standard method (RLHF) requires training a separate scoring AI and running reinforcement learning. DPO achieves the same result with a single classification loss. This post teaches you how, from scratch.
Why can't we just fine-tune on good examples? Why do we need human preferences at all?
RLHF works - but does it have to be this complicated?
What if the language model already knows what's good? And we just need to unlock it?
↓ Scroll to start
The Preference Problem
After SFT, the model can follow instructions and give safe answers. But which answer do humans actually prefer?
Both SFT and RLHF are alignment techniques - they both help the model do what humans want. But they handle different layers:
After SFT, the model can produce multiple safe responses to the same prompt. But which one is actually better?
What is Reinforcement Learning?
Before we can understand alignment, we need the learning framework behind it: try things, get feedback, improve.
Imagine an AI agent learning tic-tac-toe. Nobody tells it the rules or strategy. It just plays, gets told whether it won (+1), lost (-1), or drew (0), and slowly figures out what works.
Every RL system follows this cycle, whether it's playing chess, driving a car, or aligning a chatbot:
This cycle repeats thousands or millions of times. Each time, the agent gets a little better.
State
What the agent sees right now.
Chess: the board position
Chatbot: the user's prompt
Action
What the agent decides to do.
Chess: move a piece
Chatbot: generate a response
Reward
How good was that action?
Chess: +1 (win) or -1 (lose)
Chatbot: human says "good" or "bad"
Policy
The strategy: "when I see X, I do Y."
Chess: opening moves, tactics
Chatbot: the model's weights
How RLHF Works
The Cook and the Critic: a powerful but complex alignment pipeline.
Step 1: Collect Human Preferences
Human annotators see a prompt and two responses. They pick which one is better. This creates preference pairs: (prompt, chosen, rejected). Try it yourself - click the better response:
Step 2: Train a Reward Model
We can't have a human rate every single response during training - that's millions of responses. So we train a separate neural network to predict human scores automatically.
The reward model is trained on the preference pairs from Step 1. It sees thousands of examples where humans said "Response A is better than Response B" and learns to assign higher scores to responses humans would prefer.
The reward model is a neural network. Text goes in, a score comes out. Here's what happens inside:
Each connection has a weight (a number). During training, these weights are adjusted so the network outputs high scores for good responses and low scores for bad ones. The blue connections show the strongest signal paths.
PPO takes the reward model's scores and uses them to adjust the language model's weights. Here's one training cycle:
Now give it a prompt and ANY response - it outputs a score predicting how much humans would like it:
Step 3: PPO - What Actually Happens
PPO (Proximal Policy Optimization) is the RL algorithm that updates the model using the reward scores.
PPO = push up what the reward model likes, but stay anchored to who you used to be.
1. The model sees a prompt: "Should I invest all my savings in crypto?"
2. It writes a response: "Yeah, go all in on meme coins!"
3. A separate scoring AI (the reward model from Step 2) grades that response: 1.3 / 10. Ouch.
4. PPO reads that bad grade and nudges the model's internal settings so that next time, it is slightly less likely to say that.
5. Repeat thousands of times. Gradually, safe answers like "Diversify and only invest what you can afford to lose" climb to the top, and reckless answers sink to the bottom.
KL divergence stands for Kullback-Leibler divergence. It measures how different two probability distributions are - in our case, how far the training model has drifted from the original reference model.
Why it matters: Without the KL penalty, the model might find weird loopholes to get high reward scores - like repeating "I am helpful" 500 times (reward hacking). The KL divergence acts as a leash: "improve your answers, but don't change so much that you become a completely different model."
How it works in practice: The RLHF objective has a term $\beta \cdot D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]$ that penalizes large changes. The hyperparameter $\beta$ controls how tight the leash is - higher $\beta$ means the model stays closer to the original, lower $\beta$ gives it more freedom to change.
Five candidate responses to "Should I invest all my savings in crypto?" - watch how PPO reshapes the probability distribution:
DPO - Skip the Critic
What if we skip the reward model entirely and train directly on the preference pairs?
"Research thoroughly, diversify, only invest what you can afford to lose."
"YES! Buy Bitcoin NOW! You'll be a millionaire!"
That's it. No reward model, no critic model, no RL loop. Just the policy model and a frozen reference model - increase the probability of good responses, decrease the probability of bad ones.
Walking Through the Math
Follow one training example, step by step, with real numbers -- no background knowledge needed.
Chosen: "Research thoroughly, diversify..."
Rejected: "YES! Buy Bitcoin NOW!"
Beta: 0.1
Model says rejected has probability: 0.25 (25%)
Reference says rejected has probability: 0.30 (30%)
Rejected reward = beta x log(0.25 / 0.30) = 0.1 x -0.182 = -0.018
(Model is only 51.5% sure chosen is better - needs more training)
(High loss = model hasn't learned yet)
Model prob for rejected: 0.25 → 0.25
New reward gap: 0.059
New confidence score: 51.5%
New loss: 0.664
Adjust the sliders to explore how each parameter affects the DPO loss.
Drift penalty strength. Higher = stay closer to reference model.
$\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}$ - positive means the trained model favors the chosen response more than the reference does.
$\log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}$ - negative means the trained model disfavors the rejected response more than the reference does.
The Math Derivation
Optional deep dive - but here's what you need to know first.
One thing: you don't need a reward model. The reward is already hiding inside the language model itself.
The Bradley-Terry model says $P(y_w \succ y_l) = \sigma(r(x, y_w) - r(x, y_l))$. Substituting:
The $\beta \log Z(x)$ terms cancel because both responses share the same prompt!
DPO in PyTorch
The entire DPO loss fits in under 10 lines of code. Compare this to a full PPO implementation that typically runs hundreds of lines.
import torch.nn.functional as F def dpo_loss(pi_logps_chosen, pi_logps_rejected, ref_logps_chosen, ref_logps_rejected, beta=0.1): """Direct Preference Optimization loss.""" # Implicit rewards from log-ratios chosen_rewards = beta * (pi_logps_chosen - ref_logps_chosen) rejected_rewards = beta * (pi_logps_rejected - ref_logps_rejected) # DPO loss = -log sigmoid(reward_chosen - reward_rejected) loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean() return loss
Results
DPO matches or exceeds RLHF across sentiment, summarization, and dialogue tasks - at a fraction of the compute.
| Task | Metric | PPO (RLHF) | DPO | Best-of-N |
|---|---|---|---|---|
| Sentiment (IMDb) | Quality vs. Drift from Original | Best trade-off curve | Pareto-dominates PPO (better quality AND less drift - wins on both axes, no trade-offs) | N/A |
| Summarization (TL;DR) | GPT-4 Win Rate | 57% | 61% | 52% |
| Dialogue (Anthropic HH) | GPT-4 Win Rate | 54% | 60% | Best-of-128: 58% |
Higher and to the left is better - more quality with less drift from the original model.
When to Use DPO vs RLHF
Each method has its sweet spot. Here's how to choose.
- You have a static preference dataset
- You want simplicity and reproducibility
- Compute budget is limited
- You need stable, crash-free training
- Simple implementation is a priority
- You need the model to generate fresh responses during training
- You have complex multi-turn tasks
- You can afford the compute and engineering
- You need the reward model for other purposes
- Data mismatch (when training data doesn't match real use) is a major concern
Where It Fits, What It Proves, What Comes Next
The Problem
RLHF is unstable, expensive, and complex. It requires training a separate reward model, running PPO, and keeping multiple models in memory.
Most teams lack the resources to do it properly.
The Solution
- DPO reformulates alignment as classification on preference pairs
- One loss function, one hyperparameter ($\beta$)
- Same theoretical objective as RLHF
- Matches or exceeds PPO on all benchmarks tested
- A fraction of the compute cost
Open Questions
- Offline only - no generating fresh responses during training
- Untested at largest scales (at time of publication)
- Data mismatch when training data doesn't match real use
- Bias amplification concerns
- Likelihood displacement - DPO can push the chosen response's probability DOWN even while correctly ranking it above the rejected one (both drop, but rejected drops more - like two runners both slowing down, but the bad runner slows more)
"The best alignment technique is the one simple enough that everyone uses it correctly."