Interactive Demo

Weak-Driven
Learning

How Weak Agents Make Strong Agents Stronger

Based on Chen et al. (2026) · arXiv:2602.08222

Post-Training Logit Mixing Gradient Amplification
Scroll to explore the full pipeline
Overview

The Journey of a Language Model

From raw text to saturation, and how WMSS breaks through.

Pre-training
Learn language
SFT
Learn the task
Saturation
Gradients vanish
WMSS
Weak drives strong
The core idea: Instead of learning from a stronger teacher (knowledge distillation), WMSS uses a weaker historical checkpoint to inject structured uncertainty that reactivates vanishing gradients and drives the strong model to keep improving.
Phase 0

Pre-training: Learning Language

The base model learns to predict the next token from a huge text corpus. Its predictions start nearly uniform across the vocabulary.

Next-Token Prediction

Context: "The cat sat on the" → predict next token from vocabulary: ["cat", "dog", "car", "hat", "the"]

Base Model Probabilities (nearly uniform)
The base model assigns roughly equal probability to all tokens. It has learned language patterns but hasn't been fine-tuned for any specific task.
Phase 1

Supervised Fine-Tuning & Saturation

SFT sharpens the model's predictions toward the correct answer. But once the model becomes highly confident, gradients vanish and training stalls.

$$\mathcal{L}_{\text{SFT}} = -\log P_\theta(y \mid x) \qquad \frac{\partial \mathcal{L}}{\partial z_k} = P_\theta(k \mid x) \;\; \text{for } k \neq y$$

The gradient on non-target tokens equals their probability — as $P(k|x) \to 0$, the gradient vanishes.

Interactive SFT Training

Target token: "cat" — click Step to train

Probabilities $P(k|x)$
Gradient Magnitude $|\nabla_{z_k}|$
Epoch 0 / 6
P(cat)
Total ∇ (non-target)
Entropy H
Phase 2

Weak Agent vs Strong Agent

WMSS saves an earlier checkpoint as the Weak Agent and continues training the current model as the Strong Agent. The weak agent retains a softer decision boundary with probability mass on “hard negatives.”

$$M_{\text{weak}} \leftarrow M_0 \quad(\text{base checkpoint}), \qquad M_{\text{strong}} \leftarrow M_1 \quad(\text{after SFT})$$
Side-by-Side Comparison
Weak Agent (checkpoint $M_0$)
Entropy H
Strong Agent (after SFT $M_1$)
Entropy H
Notice: the weak agent assigns significant probability to "dog" and "hat" — these are hard negatives, plausible but incorrect tokens that the strong agent has suppressed. WMSS will use these signals.
Phase 2A

Curriculum-Enhanced Data Activation

Not all training samples are equally useful. WMSS uses entropy dynamics between weak and strong agents to weight samples by three signals: base difficulty, consolidation, and regression repair.

$$p_i \;\propto\; \alpha \cdot H(M_{\text{weak}};\,x_i) \;+\; \beta \cdot [-\Delta H_i]_+ \;+\; \gamma \cdot [\Delta H_i]_+ \qquad \Delta H_i = H(M_{\text{strong}}) - H(M_{\text{weak}})$$
Sample Weighting
$\alpha$ 1.0
$\beta$ 1.0
$\gamma$ 1.0
Consolidation ($\Delta H < 0$) Regression Repair ($\Delta H > 0$) Low weight
Phase 2B — The Core

Joint Training via Logit Mixing

The heart of WMSS: mix the weak agent's logits with the strong agent's logits to create a joint distribution. This reintroduces probability mass on hard negatives, amplifying gradients that had vanished.

$$z_{\text{mix}} = \lambda \cdot z_{\text{strong}} + (1 - \lambda) \cdot z_{\text{weak}} \qquad \mathcal{L}_{\text{mix}} = -\log P_{\text{mix}}(y \mid x)$$
Interactive Logit Mixing
$\lambda$ 0.70
Strong Logits
Weak Logits
Mixed Logits ($z_{\text{mix}}$)
$P_{\text{strong}}$  vs  $P_{\text{mix}}$ — Probability Comparison
$\nabla_{\text{strong}}$  vs  $\nabla_{\text{mix}}$ — Gradient Magnitudes (non-target tokens)
Total ∇ (strong only)
Total ∇ (mixed)
Amplification
Key insight: As $\lambda$ decreases, the weak agent's softer distribution injects more probability mass onto hard negatives. This directly amplifies gradient magnitudes via $\frac{\partial\mathcal{L}}{\partial z_k} = P_{\text{mix}}(k|x)$, reactivating learning in saturated regions.
Algorithm 1

The Complete Pipeline

WMSS follows three phases. Watch the algorithm execute step by step.

WMSS Pipeline Animation
Base Model $M_0$
Random init
SFT → $M_1$
Phase 1: Init
Curriculum
Phase 2A: $\Delta H$
Logit Mix
Phase 2B: $z_{\text{mix}}$
$M_{\text{strong}}^+$
Stronger!
Click Play Pipeline to watch WMSS execute step by step.
Results

Breaking the Saturation Ceiling

While standard SFT plateaus, WMSS continues to improve — at zero additional inference cost.

Training Curves: SFT vs WMSS
Standard SFT WMSS (Weak-Driven)
Key result: WMSS achieves effective performance improvements on mathematical reasoning (GSM8K, MATH) and code generation (HumanEval, MBPP) benchmarks, purely from improved optimization dynamics during training — no additional inference cost.