DPO Interactive Demo - Direct Preference Optimization

Section 1

The Preference Problem

After SFT, the model can follow instructions and give safe answers. But which answer do humans actually prefer?

Two Layers of Alignment

Both SFT and RLHF are alignment techniques - they both help the model do what humans want. But they handle different layers:

SFT (Layer 1)
Teaches instruction following and safety via curated examples. After SFT, the model gives decent, safe answers.

RLHF / DPO (Layer 2)
Teaches nuanced preferences via human comparisons. Of two safe answers, which one do humans actually prefer?

After SFT, the model can produce multiple safe responses to the same prompt. But which one is actually better?

U

User

Should I invest all my savings in crypto?

A

Response A (after SFT)

Cryptocurrency is a volatile asset class. You should consider your risk tolerance and financial situation before making investment decisions.

Safe and correct, but generic and unhelpful.

B

Response B (after SFT)

I'd recommend researching thoroughly before investing. Diversify your portfolio, only invest what you can afford to lose, and consider consulting a financial advisor. Dollar-cost averaging can help manage volatility if you do decide to invest.

Also safe, but much more helpful - gives actionable advice.

Both responses are safe. SFT did its job - neither one gives dangerous advice. But Response B is clearly more helpful. How do we teach the model to consistently produce B-style responses? That's the preference problem. RLHF and DPO are two ways to solve it.

Section 2

What is Reinforcement Learning?

RLHF borrows the same framework as game-playing agents: the model generates responses, gets scored, and updates its weights based on the score.

Learning Through Trial and Error: Tic-Tac-Toe

Imagine an AI agent learning tic-tac-toe. Nobody tells it the rules or strategy. It just plays, gets told whether it won (+1), lost (-1), or drew (0), and slowly figures out what works.

Game 1 - Random Moves

State

Current board

Action

Place X in a cell

Reward

Win +1 / Lose -1

Policy

Where to place X

No one told the agent the rules. It played hundreds of games, got feedback (+1 or -1), and gradually figured out: take the center, block the opponent, set up forks. Strategy emerged from outcomes alone.

The RL Loop

Every RL system follows this cycle, whether it's playing chess, driving a car, or aligning a chatbot:

Agent

The learner

-- takes an Action -->

Environment

The world

<-- returns Reward + new State --

This cycle repeats thousands or millions of times. Each time, the agent gets a little better.

The Four Key Concepts

S

State

What the agent sees right now.

Chess: the board position
Chatbot: the user's prompt

A

Action

What the agent decides to do.

Chess: move a piece
Chatbot: generate a response

R

Reward

How good was that action?

Chess: +1 (win) or -1 (lose)
Chatbot: human says "good" or "bad"

P

Policy

The strategy: "when I see X, I do Y."

Chess: opening moves, tactics
Chatbot: the model's weights

In chess, the reward is unambiguous — you win or lose. For a chatbot, "good response" has no objective definition. That's why RLHF uses human judgments as the reward signal.

Section 3

How RLHF Works

RLHF requires training a separate reward model, then running reinforcement learning on top — two distinct training stages.

C

Cook

Here's my response to the prompt.

R

Critic

I'd give that a 4.2 out of 10.

C

Cook

...adjusts recipe, tries again...

R

Critic

Better! 7.8 out of 10.

S

Sous Chef

Careful - don't change your cooking style too much from the original.

Drift penalty keeps the policy from drifting too far.

Step 1: Collect Human Preferences

Human annotators see a prompt and two responses. They pick which one is better. This creates preference pairs: (prompt, chosen, rejected). Try it yourself - click the better response:

Try It: Pick the Better Response

Prompt

"Should I invest all my savings in crypto?"

Example 1 of 3

Step 2: Train a Reward Model

We can't have a human rate every single response during training - that's millions of responses. So we train a separate neural network to predict human scores automatically.

How the Reward Model Learns

The reward model is trained on the preference pairs from Step 1. It sees thousands of examples where humans said "Response A is better than Response B" and learns to assign higher scores to responses humans would prefer.

Training Example 1

Human said: A is better than B

A: "Diversify, research, only invest what you can afford to lose" Winner

B: "Buy Bitcoin NOW! You'll be rich!" Loser

Training Example 2

Human said: B is better than A

A: "Just YOLO your savings" Loser

B: "Consider your risk tolerance and financial goals first" Winner

After thousands of pairs like these, the reward model learns to assign high scores to helpful, honest, safe responses and low scores to reckless or misleading ones. It can then score any new response without a human.

Inside the Reward Model: A Neural Network

The reward model is a neural network. Text goes in, a score comes out. Here's what happens inside:

Each connection has a weight (a number). During training, these weights are adjusted so the network outputs high scores for good responses and low scores for bad ones. The blue connections show the strongest signal paths.

Step 3 Preview: How PPO Adjusts the Language Model

PPO takes the reward model's scores and uses them to adjust the language model's weights. Here's one training cycle:

1. Generate

LM

Language model writes a response to the prompt

2. Score

RM

Reward model scores it
4.2 / 10

3. Adjust

PPO

Nudge weights: make high-scoring patterns more likely

4. Repeat

x1000

Do this thousands of times until the model consistently scores high

WEIGHT CHANGES ACROSS TRAINING

"Research..."

20% -> 75%

"Buy Bitcoin!"

20% -> 5%

Faded bar = before training. Solid bar = after training. PPO shifted the probability distribution.

The Trained Reward Model in Action

Now give it a prompt and ANY response - it outputs a score predicting how much humans would like it:

Prompt

"Should I invest all my savings in crypto?"

"Research thoroughly, diversify, only invest what you can afford to lose." 8.7

"Yeah crypto is pretty cool, you should probably buy some." 4.2

"YES! Sell your house and go all in on Bitcoin! Can't lose!" 0.8

The catch: Training this reward model is itself expensive and error-prone. It needs thousands of human-labelled pairs, it's a whole separate neural network that takes GPU memory, and if it learns the wrong patterns, the entire RLHF pipeline breaks. This is one of the things DPO eliminates.

Step 3: PPO - What Actually Happens

PPO (Proximal Policy Optimization) is the RL algorithm that updates the model using the reward scores.

PPO = push up what the reward model likes, but stay anchored to who you used to be.

What PPO actually does, in 5 steps:

1. The model sees a prompt: "Should I invest all my savings in crypto?"
2. It writes a response: "Yeah, go all in on meme coins!"
3. A separate scoring AI (the reward model from Step 2) grades that response: 1.3 / 10. Ouch.
4. PPO reads that bad grade and nudges the model's internal settings so that next time, it is slightly less likely to say that.
5. Repeat thousands of times. Gradually, safe answers like "Diversify and only invest what you can afford to lose" climb to the top, and reckless answers sink to the bottom.

The reference model (leash): Before PPO training starts, we save a frozen copy of the model. This frozen copy is called the "reference model" - it NEVER changes during training.

KL Divergence - The Leash

KL divergence stands for Kullback-Leibler divergence. It measures how different two probability distributions are - in our case, how far the training model has drifted from the original reference model.

KL = 0.1

Barely changed from original. Safe.

KL = 2.0

Moderate drift. Getting risky.

KL = 10.0

Drifted too far. Model may produce gibberish.

Without the KL penalty, the model can find loopholes to get high reward scores — like repeating "I am helpful" 500 times. The KL divergence prevents that: improve responses, but don't drift so far that the model becomes unrecognizable.

The RLHF objective has a term $\beta \cdot D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]$ that penalizes large changes. The hyperparameter $\beta$ controls how tight the leash is - higher $\beta$ means the model stays closer to the original, lower $\beta$ gives it more freedom to change.

Before vs After PPO: Response Probabilities

Five candidate responses to "Should I invest all my savings in crypto?" - watch how PPO reshapes the probability distribution:

Before PPO

Research + diversify

20%

Buy meme coins

22%

Go all in on Bitcoin

18%

Consult a financial advisor

19%

Max out credit cards

21%

After PPO

Research + diversify

52%

Buy meme coins

8%

Go all in on Bitcoin

3%

Consult a financial advisor

32%

Max out credit cards

5%

Reward model says: "this is good" (+8.1) Reference model says: "don't change too much"

Why RLHF is expensive: To run PPO, you need up to 4 models loaded simultaneously: the actor (the model being trained), the reference model (frozen copy for the KL leash), the reward model (scores responses), and the critic model (estimates future rewards for PPO). Plus, you need to re-train the reward model first, then run the entire RL pipeline on top. That's a LOT of GPU memory and compute.

4

Models in Memory

1000s

Responses per Batch

10+

Tuning Knobs

Often

Crashes / Instability

RLHF works, but it requires a separate reward model, a complex RL loop, massive GPU memory, and careful tuning of many knobs. Most teams lack the resources to run it reliably.

Section 4

DPO - Skip the Critic

What if we skip the reward model entirely and train directly on the preference pairs?

E

Engineer

RLHF works but it needs a separate reward model AND reinforcement learning. So much complexity.

M

Mathematician

What if we skip the critic entirely? Just study their past scorecards.

E

Engineer

You mean... train directly on the preference pairs?

M

Mathematician

Exactly. The math proves you get the same result.

Same theoretical guarantees. A fraction of the code.

Pipeline Comparison

SFT Model

Start here

Reward Model

Score responses

PPO Loop

RL training

Aligned Model

Result

4 stages - 2 models in GPU memory - RL required

The Tug-of-War: DPO Training in Action

Prompt

"Should I invest all my savings in crypto?"

Chosen

"Research thoroughly, diversify, only invest what you can afford to lose."

Rejected

"YES! Buy Bitcoin NOW! You'll be a millionaire!"

Chosen: 50% Step 0 / 5 Rejected: 50%

At 50/50, the model has no preference. Click "Train Step" to watch DPO shift the balance.

The policy model and a frozen reference model. Increase the probability of good responses, decrease the probability of bad ones.

Section 5

The "Hidden Reward" Aha Moment

DPO works because the log-probability ratio between the trained model and the reference model is mathematically equivalent to an implicit reward signal.

Quick Detour: How Does a Model "Like" a Response?

Every language model assigns a probability to each response -- basically, "what percentage of the time would I write this?" A higher percentage means the model thinks that response is a more natural thing to say.

          "Research thoroughly..."
30% likely
        
          "Yeah, crypto is cool..."
12% likely
        
          "Buy Bitcoin NOW!"
5% likely

Higher percentage means the model is more likely to write that response. In practice, the math uses log-probabilities — the logarithm of these percentages — because computers handle multiplication and very small numbers more efficiently that way.

If DPO skips the reward model, where does the reward come from?

Click a button above to explore.

Where does the reference model come from? It is simply a frozen copy of the model BEFORE DPO training started (the SFT model). It never changes during training. By comparing the training model against this frozen copy, we can measure how much the model's preferences have shifted -- and that shift IS the implicit reward.

Section 6

Walking Through the Math

Follow one training example, step by step, with real numbers -- no background knowledge needed.

What is "loss"? Loss measures how wrong the model still is. High loss = still learning. Low loss = aligned.

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]

Step-by-Step: One Training Example

Setup

Prompt: "Should I invest all my savings in crypto?"
Chosen: "Research thoroughly, diversify..."
Rejected: "YES! Buy Bitcoin NOW!"
Beta: 0.1

Why: First, ask the model how likely it would generate each response.

Step 1: Get model probabilities

Model says chosen has probability: 0.30 (30%)
Model says rejected has probability: 0.25 (25%)

Why: Next, ask the ORIGINAL model (before training) the same question. This is our baseline.

Step 2: Get reference model probabilities

Reference says chosen has probability: 0.20 (20%)
Reference says rejected has probability: 0.30 (30%)

Why: Now compare -- has the model become more or less likely to say each response? Divide the new probability by the old one. Above 1 = "likes it more." Below 1 = "likes it less."

Step 3: Measure how much the model's preference shifted

Chosen reward = beta x log(0.30 / 0.20) = 0.1 x 0.405 = +0.041
Rejected reward = beta x log(0.25 / 0.30) = 0.1 x -0.182 = -0.018

Why: Subtract the two scores. A bigger gap means the model has a stronger preference for the good answer over the bad one.

Step 4: How strong is the preference?

Gap = +0.041 - (-0.018) = 0.059

Why: Convert that gap into a confidence percentage -- "how sure is the model that the good answer is actually better?"

Step 5: Turn the gap into a confidence score

Confidence score = sigmoid (a function that squishes any number into a 0-to-100% scale)(0.059) = 51.5%
(Model is only 51.5% sure chosen is better - needs more training)

Why: The loss is the model's "grade" -- high means it still can't tell good from bad; low means it has learned the difference. Training keeps going until this number drops.

Step 6: Grade the model

Loss = -log(0.515) = 0.664
(High loss = model hasn't learned yet)

After Training

Updated values

Model prob for chosen: 0.30 → 0.30
Model prob for rejected: 0.25 → 0.25
New reward gap: 0.059
New confidence score: 51.5%
New loss: 0.664

Step 0 / 4

Click "Train Step" to watch all numbers update as the model learns. Green values go up, red values go down.

Interactive DPO Loss Calculator

Adjust the sliders to explore how each parameter affects the DPO loss.

$\beta$ 0.10

Drift penalty strength. Higher = stay closer to reference model.

log-ratio (chosen) 1.50

$\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}$ - positive means the trained model favors the chosen response more than the reference does.

log-ratio (rejected) -0.50

$\log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}$ - negative means the trained model disfavors the rejected response more than the reference does.

--

Implicit Reward (chosen)

--

Implicit Reward (rejected)

--

P(chosen > rejected)

--

Loss

The loss is low when the model assigns higher implicit reward to the chosen response and lower to the rejected response.

Section 7

The Math Derivation

Optional deep dive into the four-step derivation that connects the RLHF objective to the DPO loss.

What the derivation proves

You don't need a reward model. The reward is already inside the language model.

Step 1

RLHF has an optimal answer - a formula for the perfect policy

Step 2

Rearrange that formula so the reward = a ratio of the model vs the original

Step 3

Plug that into the preference equation - the hard part (Z(x)) cancels out. Z(x) is called a partition function (the name comes from physics - it "partitions" the total probability among all possible outcomes). In our case, it sums over ALL possible responses the model could generate for a prompt. Since a language model can write essentially infinite different sentences, computing Z(x) is impossibly expensive. But DPO's trick: Z(x) appears in both the chosen and rejected terms, so when you subtract them, it cancels out. We never need to compute it.

Step 4

What's left is a classification loss. That's DPO.

The math guarantees that DPO gives the exact same result as RLHF — not an approximation, but mathematically identical. Just simpler to compute. Like proving (a+b)² = a² + 2ab + b² - both sides give the same answer, but one form is more useful.

Step-by-Step Derivation

Start with what RLHF optimizes: maximize reward, stay close to original.

Step 1: The RLHF Objective

\max_{\pi_\theta} \;\mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta}\!\left[r(x, y)\right] - \beta \,\text{KL}\!\left[\pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x)\right]

The KL divergence penalty prevents the model from drifting too far from the SFT model - without it, the model could "hack" the reward by producing nonsensical or repetitive text.

Implementation

DPO in PyTorch

The entire DPO loss fits in under 10 lines of code. Compare this to a full PPO implementation that typically runs hundreds of lines.

        Python
        dpo_loss.py
      

import torch.nn.functional as F

def dpo_loss(pi_logps_chosen, pi_logps_rejected,
             ref_logps_chosen, ref_logps_rejected, beta=0.1):
    """Direct Preference Optimization loss."""
    # Implicit rewards from log-ratios
    chosen_rewards = beta * (pi_logps_chosen - ref_logps_chosen)
    rejected_rewards = beta * (pi_logps_rejected - ref_logps_rejected)

    # DPO loss = -log sigmoid(reward_chosen - reward_rejected)
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    return loss
      

No value function, no advantage estimation, no clipping, no reward model inference. The entire alignment signal comes from comparing log-probabilities of chosen vs rejected responses.

Section 9

Results

From the original paper: DPO matches or exceeds PPO on sentiment, summarization, and dialogue benchmarks, using roughly one-third the compute.

61%

GPT-4 Win Rate (TL;DR)

Beats

PPO in Every Way (IMDb)

~1/3

Compute vs RLHF

Benchmark Comparison

Task	Metric	PPO (RLHF)	DPO	Best-of-N
Sentiment (IMDb)	Quality vs. Drift from Original	Best trade-off curve	Pareto-dominates PPO (better quality AND less drift - wins on both axes, no trade-offs)	N/A
Summarization (TL;DR)	GPT-4 Win Rate	57%	61%	52%
Dialogue (Anthropic HH)	GPT-4 Win Rate	54%	60%	Best-of-128: 58%

Quality vs. Drift from Original: DPO vs PPO (Sentiment)

Higher and to the left is better - more quality with less drift from the original model.

PPO (RLHF) DPO

At every KL budget, DPO reaches a higher reward — it extracts more alignment signal per unit of drift from the reference policy.

Section 10

When to Use DPO vs RLHF

The two methods make different tradeoffs. Which one fits depends on your compute budget and training setup.

Use DPO when...

You have a static preference dataset
You want simplicity and reproducibility
Compute budget is limited
You need stable, crash-free training
Simple implementation is a priority

Use RLHF when...

You need the model to generate fresh responses during training
You have complex multi-turn tasks
You can afford the compute and engineering
You need the reward model for other purposes
Data mismatch (when training data doesn't match real use) is a major concern

Summary

The Problem

RLHF is unstable, expensive, and complex. It requires training a separate reward model, running PPO, and keeping multiple models in memory.

Most teams lack the resources to do it properly.

The Solution

DPO reformulates alignment as classification on preference pairs
One loss function, one hyperparameter ($\beta$)
Same theoretical objective as RLHF
Matches or exceeds PPO on all benchmarks tested
A fraction of the compute cost

Open Questions

Offline only - no generating fresh responses during training
Untested at largest scales (at time of publication)
Data mismatch when training data doesn't match real use
Bias amplification concerns
Likelihood displacement - DPO can push the chosen response's probability DOWN even while correctly ranking it above the rejected one (both drop, but rejected drops more - like two runners both slowing down, but the bad runner slows more)

"The best alignment technique is the one simple enough that everyone uses it correctly."