Interactive Demo

Direct Preference
Optimization

Your Language Model is Secretly a Reward Model

Based on Rafailov et al. (2023) · arXiv:2305.18290 · NeurIPS 2023

Prepared for topic sharing by Jin Jianzuo, Kuibin Lin, Lee Shih Nyien, Shen Yanwen, Tan Sze Vy & Viknesh
All authors contributed equally. Names listed in alphabetical order.
Reviewed by Kuibin Lin

Post-Training Preference Learning Alignment
Scroll to explore the full pipeline

ChatGPT doesn't just predict words - it's been carefully trained to give answers humans prefer. But the standard method (RLHF) requires training a separate scoring AI and running reinforcement learning. DPO achieves the same result with a single classification loss. This post teaches you how, from scratch.

1

Why can't we just fine-tune on good examples? Why do we need human preferences at all?

2

RLHF works - but does it have to be this complicated?

3

What if the language model already knows what's good? And we just need to unlock it?

10-minute read · interactive demos · no ML background required
↓ Scroll to start
Acknowledgment: Special thanks to Kuibin Lin (GitHub) for reviewing this post and providing detailed technical feedback on the RLHF/DPO explanations.
Section 1

The Preference Problem

After SFT, the model can follow instructions and give safe answers. But which answer do humans actually prefer?

Two Layers of Alignment

Both SFT and RLHF are alignment techniques - they both help the model do what humans want. But they handle different layers:

SFT (Layer 1)
Teaches instruction following and safety via curated examples. After SFT, the model gives decent, safe answers.
RLHF / DPO (Layer 2)
Teaches nuanced preferences via human comparisons. Of two safe answers, which one do humans actually prefer?

After SFT, the model can produce multiple safe responses to the same prompt. But which one is actually better?

U
User
Should I invest all my savings in crypto?
A
Response A (after SFT)
Cryptocurrency is a volatile asset class. You should consider your risk tolerance and financial situation before making investment decisions.
Safe and correct, but generic and unhelpful.
B
Response B (after SFT)
I'd recommend researching thoroughly before investing. Diversify your portfolio, only invest what you can afford to lose, and consider consulting a financial advisor. Dollar-cost averaging can help manage volatility if you do decide to invest.
Also safe, but much more helpful - gives actionable advice.
Both responses are safe. SFT did its job - neither one gives dangerous advice. But Response B is clearly more helpful. How do we teach the model to consistently produce B-style responses? That's the preference problem. RLHF and DPO are two ways to solve it.
Section 2

What is Reinforcement Learning?

Before we can understand alignment, we need the learning framework behind it: try things, get feedback, improve.

Learning Through Trial and Error: Tic-Tac-Toe

Imagine an AI agent learning tic-tac-toe. Nobody tells it the rules or strategy. It just plays, gets told whether it won (+1), lost (-1), or drew (0), and slowly figures out what works.

Game 1 - Random Moves
State
Current board
Action
Place X in a cell
Reward
Win +1 / Lose -1
Policy
Where to place X
That's reinforcement learning. No one told the agent the rules. It played hundreds of games, got feedback (+1 or -1), and gradually figured out: take the center, block the opponent, set up forks. It learned strategy from outcomes alone.
The RL Loop

Every RL system follows this cycle, whether it's playing chess, driving a car, or aligning a chatbot:

Agent
The learner
-- takes an Action -->
Environment
The world
<-- returns Reward + new State --

This cycle repeats thousands or millions of times. Each time, the agent gets a little better.

The Four Key Concepts
S

State

What the agent sees right now.

Chess: the board position
Chatbot: the user's prompt

A

Action

What the agent decides to do.

Chess: move a piece
Chatbot: generate a response

R

Reward

How good was that action?

Chess: +1 (win) or -1 (lose)
Chatbot: human says "good" or "bad"

P

Policy

The strategy: "when I see X, I do Y."

Chess: opening moves, tactics
Chatbot: the model's weights

The key challenge for chatbots: In chess, the reward is clear - you win or lose. But for a chatbot, what counts as a "good" response? We need humans to judge. That's where RLHF comes in.
Section 3

How RLHF Works

The Cook and the Critic: a powerful but complex alignment pipeline.

C
Cook
Here's my response to the prompt.
R
Critic
I'd give that a 4.2 out of 10.
C
Cook
...adjusts recipe, tries again...
R
Critic
Better! 7.8 out of 10.
S
Sous Chef
Careful - don't change your cooking style too much from the original.
Drift penalty keeps the policy from drifting too far.

Step 1: Collect Human Preferences

Human annotators see a prompt and two responses. They pick which one is better. This creates preference pairs: (prompt, chosen, rejected). Try it yourself - click the better response:

Try It: Pick the Better Response
Prompt
"Should I invest all my savings in crypto?"
Example 1 of 3

Step 2: Train a Reward Model

We can't have a human rate every single response during training - that's millions of responses. So we train a separate neural network to predict human scores automatically.

How the Reward Model Learns

The reward model is trained on the preference pairs from Step 1. It sees thousands of examples where humans said "Response A is better than Response B" and learns to assign higher scores to responses humans would prefer.

Training Example 1
Human said: A is better than B
A: "Diversify, research, only invest what you can afford to lose" Winner
B: "Buy Bitcoin NOW! You'll be rich!" Loser
Training Example 2
Human said: B is better than A
A: "Just YOLO your savings" Loser
B: "Consider your risk tolerance and financial goals first" Winner
After seeing thousands of pairs like these, the reward model learns a pattern: responses that are helpful, honest, and safe get high scores. Responses that are reckless, misleading, or harmful get low scores. Now it can score ANY new response automatically.
Inside the Reward Model: A Neural Network

The reward model is a neural network. Text goes in, a score comes out. Here's what happens inside:

INPUT HIDDEN LAYER 1 HIDDEN LAYER 2 OUTPUT word 1 word 2 word n 8.7 "Research thoroughly..." Processes patterns Extracts meaning Score

Each connection has a weight (a number). During training, these weights are adjusted so the network outputs high scores for good responses and low scores for bad ones. The blue connections show the strongest signal paths.

Step 3 Preview: How PPO Adjusts the Language Model

PPO takes the reward model's scores and uses them to adjust the language model's weights. Here's one training cycle:

1. Generate
LM
Language model writes a response to the prompt
2. Score
RM
Reward model scores it
4.2 / 10
3. Adjust
PPO
Nudge weights: make high-scoring patterns more likely
4. Repeat
x1000
Do this thousands of times until the model consistently scores high
WEIGHT CHANGES ACROSS TRAINING
"Research..."
20% -> 75%
"Buy Bitcoin!"
20% -> 5%
Faded bar = before training. Solid bar = after training. PPO shifted the probability distribution.
The Trained Reward Model in Action

Now give it a prompt and ANY response - it outputs a score predicting how much humans would like it:

Prompt
"Should I invest all my savings in crypto?"
"Research thoroughly, diversify, only invest what you can afford to lose." 8.7
"Yeah crypto is pretty cool, you should probably buy some." 4.2
"YES! Sell your house and go all in on Bitcoin! Can't lose!" 0.8
The catch: Training this reward model is itself expensive and error-prone. It needs thousands of human-labelled pairs, it's a whole separate neural network that takes GPU memory, and if it learns the wrong patterns, the entire RLHF pipeline breaks. This is one of the things DPO eliminates.

Step 3: PPO - What Actually Happens

PPO (Proximal Policy Optimization) is the RL algorithm that updates the model using the reward scores.

PPO = push up what the reward model likes, but stay anchored to who you used to be.

What PPO actually does, in 5 steps:

1. The model sees a prompt: "Should I invest all my savings in crypto?"
2. It writes a response: "Yeah, go all in on meme coins!"
3. A separate scoring AI (the reward model from Step 2) grades that response: 1.3 / 10. Ouch.
4. PPO reads that bad grade and nudges the model's internal settings so that next time, it is slightly less likely to say that.
5. Repeat thousands of times. Gradually, safe answers like "Diversify and only invest what you can afford to lose" climb to the top, and reckless answers sink to the bottom.
The reference model (leash): Before PPO training starts, we save a frozen copy of the model. This frozen copy is called the "reference model" - it NEVER changes during training.
KL Divergence - The Leash

KL divergence stands for Kullback-Leibler divergence. It measures how different two probability distributions are - in our case, how far the training model has drifted from the original reference model.

KL = 0.1
Barely changed from original. Safe.
KL = 2.0
Moderate drift. Getting risky.
KL = 10.0
Drifted too far. Model may produce gibberish.

Why it matters: Without the KL penalty, the model might find weird loopholes to get high reward scores - like repeating "I am helpful" 500 times (reward hacking). The KL divergence acts as a leash: "improve your answers, but don't change so much that you become a completely different model."

How it works in practice: The RLHF objective has a term $\beta \cdot D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]$ that penalizes large changes. The hyperparameter $\beta$ controls how tight the leash is - higher $\beta$ means the model stays closer to the original, lower $\beta$ gives it more freedom to change.

Before vs After PPO: Response Probabilities

Five candidate responses to "Should I invest all my savings in crypto?" - watch how PPO reshapes the probability distribution:

Before PPO
Research + diversify
20%
Buy meme coins
22%
Go all in on Bitcoin
18%
Consult a financial advisor
19%
Max out credit cards
21%
After PPO
Research + diversify
52%
Buy meme coins
8%
Go all in on Bitcoin
3%
Consult a financial advisor
32%
Max out credit cards
5%
Reward model says: "this is good" (+8.1) Reference model says: "don't change too much"
Why RLHF is expensive: To run PPO, you need up to 4 models loaded simultaneously: the actor (the model being trained), the reference model (frozen copy for the KL leash), the reward model (scores responses), and the critic model (estimates future rewards for PPO). Plus, you need to re-train the reward model first, then run the entire RL pipeline on top. That's a LOT of GPU memory and compute.
4
Models in Memory
1000s
Responses per Batch
10+
Tuning Knobs
Often
Crashes / Instability
The cost of RLHF: It works, but it requires a separate reward model, a complex RL loop, massive GPU memory, and careful tuning of many knobs. Most teams lack the resources to do it properly. What if we could skip all of that?
Section 4

DPO - Skip the Critic

What if we skip the reward model entirely and train directly on the preference pairs?

E
Engineer
RLHF works but it needs a separate reward model AND reinforcement learning. So much complexity.
M
Mathematician
What if we skip the critic entirely? Just study their past scorecards.
E
Engineer
You mean... train directly on the preference pairs?
M
Mathematician
Exactly. The math proves you get the same result.
Same theoretical guarantees. A fraction of the code.
Pipeline Comparison
SFT Model
Start here
Reward Model
Score responses
PPO Loop
RL training
Aligned Model
Result
4 stages - 2 models in GPU memory - RL required
The Tug-of-War: DPO Training in Action
Prompt
"Should I invest all my savings in crypto?"
Chosen

"Research thoroughly, diversify, only invest what you can afford to lose."

Rejected

"YES! Buy Bitcoin NOW! You'll be a millionaire!"

Chosen: 50% Step 0 / 5 Rejected: 50%
At 50/50, the model has no preference. Click "Train Step" to watch DPO shift the balance.

That's it. No reward model, no critic model, no RL loop. Just the policy model and a frozen reference model - increase the probability of good responses, decrease the probability of bad ones.

Section 5

The "Hidden Reward" Aha Moment

The key DPO insight: the reward was hiding inside the language model all along.

Quick Detour: How Does a Model "Like" a Response?

Every language model assigns a probability to each response -- basically, "what percentage of the time would I write this?" A higher percentage means the model thinks that response is a more natural thing to say.

"Research thoroughly..."
30% likely
"Yeah, crypto is cool..."
12% likely
"Buy Bitcoin NOW!"
5% likely
That's it. Higher percentage = the model is more likely to write that response. In the actual math, we use "log-probabilities" (the logarithm of these percentages) because computers handle multiplication and very small numbers more efficiently that way. But the idea is just: how likely is each response?

Now the key question: if DPO skips the reward model, where does the reward come from? Click the buttons below:

Click a button above to explore.

Where does the reference model come from? It is simply a frozen copy of the model BEFORE DPO training started (the SFT model). It never changes during training. By comparing the training model against this frozen copy, we can measure how much the model's preferences have shifted -- and that shift IS the implicit reward.
Section 6

Walking Through the Math

Follow one training example, step by step, with real numbers -- no background knowledge needed.

What is "loss"? Loss measures how wrong the model still is. High loss = still learning. Low loss = aligned.
$$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$
Step-by-Step: One Training Example
Setup
Prompt: "Should I invest all my savings in crypto?"
Chosen: "Research thoroughly, diversify..."
Rejected: "YES! Buy Bitcoin NOW!"
Beta: 0.1
Why: First, ask the model how likely it would generate each response.
Step 1: Get model probabilities
Model says chosen has probability: 0.30 (30%)
Model says rejected has probability: 0.25 (25%)
Why: Next, ask the ORIGINAL model (before training) the same question. This is our baseline.
Step 2: Get reference model probabilities
Reference says chosen has probability: 0.20 (20%)
Reference says rejected has probability: 0.30 (30%)
Why: Now compare -- has the model become more or less likely to say each response? Divide the new probability by the old one. Above 1 = "likes it more." Below 1 = "likes it less."
Step 3: Measure how much the model's preference shifted
Chosen reward = beta x log(0.30 / 0.20) = 0.1 x 0.405 = +0.041
Rejected reward = beta x log(0.25 / 0.30) = 0.1 x -0.182 = -0.018
Why: Subtract the two scores. A bigger gap means the model has a stronger preference for the good answer over the bad one.
Step 4: How strong is the preference?
Gap = +0.041 - (-0.018) = 0.059
Why: Convert that gap into a confidence percentage -- "how sure is the model that the good answer is actually better?"
Step 5: Turn the gap into a confidence score
Confidence score = sigmoid (a function that squishes any number into a 0-to-100% scale)(0.059) = 51.5%
(Model is only 51.5% sure chosen is better - needs more training)
Why: The loss is the model's "grade" -- high means it still can't tell good from bad; low means it has learned the difference. Training keeps going until this number drops.
Step 6: Grade the model
Loss = -log(0.515) = 0.664
(High loss = model hasn't learned yet)
After Training
Updated values
Model prob for chosen: 0.30 0.30
Model prob for rejected: 0.25 0.25
New reward gap: 0.059
New confidence score: 51.5%
New loss: 0.664
Step 0 / 4
Click "Train Step" to watch all numbers update as the model learns. Green values go up, red values go down.
Interactive DPO Loss Calculator

Adjust the sliders to explore how each parameter affects the DPO loss.

$\beta$ 0.10

Drift penalty strength. Higher = stay closer to reference model.

log-ratio (chosen) 1.50

$\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}$ - positive means the trained model favors the chosen response more than the reference does.

log-ratio (rejected) -0.50

$\log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}$ - negative means the trained model disfavors the rejected response more than the reference does.

--
Implicit Reward (chosen)
--
Implicit Reward (rejected)
--
P(chosen > rejected)
--
Loss
The loss is low when the model assigns higher implicit reward to the chosen response and lower to the rejected response.
Section 7

The Math Derivation

Optional deep dive - but here's what you need to know first.

What is the math trying to prove?

One thing: you don't need a reward model. The reward is already hiding inside the language model itself.

Step 1
RLHF has an optimal answer - a formula for the perfect policy
Step 2
Rearrange that formula so the reward = a ratio of the model vs the original
Step 3
Plug that into the preference equation - the hard part (Z(x)) cancels out. Z(x) is called a partition function (the name comes from physics - it "partitions" the total probability among all possible outcomes). In our case, it sums over ALL possible responses the model could generate for a prompt. Since a language model can write essentially infinite different sentences, computing Z(x) is impossibly expensive. But DPO's trick: Z(x) appears in both the chosen and rejected terms, so when you subtract them, it cancels out. We never need to compute it.
Step 4
What's left is a classification loss. That's DPO.
Why this matters: The math guarantees that DPO gives the exact same result as RLHF - not an approximation, not a shortcut, but mathematically identical. Just simpler to compute. Like proving (a+b)² = a² + 2ab + b² - both sides give the same answer, but one form is more useful.
Step-by-Step Derivation
Start with what RLHF optimizes: maximize reward, stay close to original.
Step 1: The RLHF Objective
$$\max_{\pi_\theta} \;\mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta}\!\left[r(x, y)\right] - \beta \,\text{KL}\!\left[\pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x)\right]$$
The KL divergence penalty prevents the model from drifting too far from the SFT model - without it, the model could "hack" the reward by producing nonsensical or repetitive text.
Implementation

DPO in PyTorch

The entire DPO loss fits in under 10 lines of code. Compare this to a full PPO implementation that typically runs hundreds of lines.

Python dpo_loss.py
import torch.nn.functional as F

def dpo_loss(pi_logps_chosen, pi_logps_rejected,
             ref_logps_chosen, ref_logps_rejected, beta=0.1):
    """Direct Preference Optimization loss."""
    # Implicit rewards from log-ratios
    chosen_rewards = beta * (pi_logps_chosen - ref_logps_chosen)
    rejected_rewards = beta * (pi_logps_rejected - ref_logps_rejected)

    # DPO loss = -log sigmoid(reward_chosen - reward_rejected)
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    return loss
That's it. No value function, no advantage estimation, no clipping, no reward model inference. The entire alignment signal comes from comparing log-probabilities of chosen vs rejected responses.
Section 9

Results

DPO matches or exceeds RLHF across sentiment, summarization, and dialogue tasks - at a fraction of the compute.

61%
GPT-4 Win Rate (TL;DR)
Beats
PPO in Every Way (IMDb)
~1/3
Compute vs RLHF
Benchmark Comparison
Task Metric PPO (RLHF) DPO Best-of-N
Sentiment (IMDb) Quality vs. Drift from Original Best trade-off curve Pareto-dominates PPO (better quality AND less drift - wins on both axes, no trade-offs) N/A
Summarization (TL;DR) GPT-4 Win Rate 57% 61% 52%
Dialogue (Anthropic HH) GPT-4 Win Rate 54% 60% Best-of-128: 58%
Quality vs. Drift from Original: DPO vs PPO (Sentiment)

Higher and to the left is better - more quality with less drift from the original model.

PPO (RLHF) DPO
DPO achieves higher quality at every drift budget. This means DPO extracts more alignment signal per unit of drift from the reference policy.
Section 10

When to Use DPO vs RLHF

Each method has its sweet spot. Here's how to choose.

Use DPO when...
  • You have a static preference dataset
  • You want simplicity and reproducibility
  • Compute budget is limited
  • You need stable, crash-free training
  • Simple implementation is a priority
Use RLHF when...
  • You need the model to generate fresh responses during training
  • You have complex multi-turn tasks
  • You can afford the compute and engineering
  • You need the reward model for other purposes
  • Data mismatch (when training data doesn't match real use) is a major concern
Summary

Where It Fits, What It Proves, What Comes Next

The Problem

RLHF is unstable, expensive, and complex. It requires training a separate reward model, running PPO, and keeping multiple models in memory.

Most teams lack the resources to do it properly.

The Solution

  • DPO reformulates alignment as classification on preference pairs
  • One loss function, one hyperparameter ($\beta$)
  • Same theoretical objective as RLHF
  • Matches or exceeds PPO on all benchmarks tested
  • A fraction of the compute cost

Open Questions

  • Offline only - no generating fresh responses during training
  • Untested at largest scales (at time of publication)
  • Data mismatch when training data doesn't match real use
  • Bias amplification concerns
  • Likelihood displacement - DPO can push the chosen response's probability DOWN even while correctly ranking it above the rejected one (both drop, but rejected drops more - like two runners both slowing down, but the bad runner slows more)

"The best alignment technique is the one simple enough that everyone uses it correctly."