WMSS Interactive Demo — Weak-Driven Learning

Overview

The Journey of a Language Model

From raw text to saturation, and how WMSS breaks through.

Pre-training

Learn language

SFT

Learn the task

Saturation

Gradients vanish

WMSS

Weak drives strong

The core idea: Instead of learning from a stronger teacher (knowledge distillation), WMSS uses a weaker historical checkpoint to inject structured uncertainty that reactivates vanishing gradients and drives the strong model to keep improving.

Phase 0

Pre-training: Learning Language

The base model learns to predict the next token from a huge text corpus. Its predictions start nearly uniform across the vocabulary.

Next-Token Prediction

Context: "The cat sat on the" → predict next token from vocabulary: ["cat", "dog", "car", "hat", "the"]

Base Model Probabilities (nearly uniform)

The base model assigns roughly equal probability to all tokens. It has learned language patterns but hasn't been fine-tuned for any specific task.

Phase 1

Supervised Fine-Tuning & Saturation

SFT sharpens the model's predictions toward the correct answer. But once the model becomes highly confident, gradients vanish and training stalls.

\mathcal{L}_{\text{SFT}} = -\log P_\theta(y \mid x) \qquad \frac{\partial \mathcal{L}}{\partial z_k} = P_\theta(k \mid x) \;\; \text{for } k \neq y

The gradient on non-target tokens equals their probability — as $P(k|x) \to 0$, the gradient vanishes.

Interactive SFT Training

Target token: "cat" — click Step to train

Probabilities $P(k|x)$

Gradient Magnitude $|\nabla_{z_k}|$

Epoch 0 / 6

—

P(cat)

—

Total ∇ (non-target)

—

Entropy H

Phase 2

Weak Agent vs Strong Agent

WMSS saves an earlier checkpoint as the Weak Agent and continues training the current model as the Strong Agent. The weak agent retains a softer decision boundary with probability mass on “hard negatives.”

M_{\text{weak}} \leftarrow M_0 \quad(\text{base checkpoint}), \qquad M_{\text{strong}} \leftarrow M_1 \quad(\text{after SFT})

Side-by-Side Comparison

Weak Agent (checkpoint $M_0$)

—

Entropy H

Strong Agent (after SFT $M_1$)

—

Entropy H

Notice: the weak agent assigns significant probability to "dog" and "hat" — these are hard negatives, plausible but incorrect tokens that the strong agent has suppressed. WMSS will use these signals.

Phase 2A

Curriculum-Enhanced Data Activation

Not all training samples are equally useful. WMSS uses entropy dynamics between weak and strong agents to weight samples by three signals: base difficulty, consolidation, and regression repair.

p_i \;\propto\; \alpha \cdot H(M_{\text{weak}};\,x_i) \;+\; \beta \cdot [-\Delta H_i]_+ \;+\; \gamma \cdot [\Delta H_i]_+ \qquad \Delta H_i = H(M_{\text{strong}}) - H(M_{\text{weak}})

Sample Weighting

$\alpha$ 1.0

$\beta$ 1.0

$\gamma$ 1.0

Consolidation ($\Delta H < 0$) Regression Repair ($\Delta H > 0$) Low weight

Phase 2B — The Core

Joint Training via Logit Mixing

The heart of WMSS: mix the weak agent's logits with the strong agent's logits to create a joint distribution. This reintroduces probability mass on hard negatives, amplifying gradients that had vanished.

z_{\text{mix}} = \lambda \cdot z_{\text{strong}} + (1 - \lambda) \cdot z_{\text{weak}} \qquad \mathcal{L}_{\text{mix}} = -\log P_{\text{mix}}(y \mid x)

Interactive Logit Mixing

$\lambda$ 0.70

Strong Logits

Weak Logits

Mixed Logits ($z_{\text{mix}}$)

$P_{\text{strong}}$ vs $P_{\text{mix}}$ — Probability Comparison

$\nabla_{\text{strong}}$ vs $\nabla_{\text{mix}}$ — Gradient Magnitudes (non-target tokens)

—

Total ∇ (strong only)

—

Total ∇ (mixed)

—

Amplification

Key insight: As $\lambda$ decreases, the weak agent's softer distribution injects more probability mass onto hard negatives. This directly amplifies gradient magnitudes via $\frac{\partial\mathcal{L}}{\partial z_k} = P_{\text{mix}}(k|x)$, reactivating learning in saturated regions.

Algorithm 1

The Complete Pipeline

WMSS follows three phases. Watch the algorithm execute step by step.

WMSS Pipeline Animation

Base Model $M_0$

Random init

SFT → $M_1$

Phase 1: Init

Curriculum

Phase 2A: $\Delta H$

Logit Mix

Phase 2B: $z_{\text{mix}}$

$M_{\text{strong}}^+$

Stronger!

Click Play Pipeline to watch WMSS execute step by step.

Results

Breaking the Saturation Ceiling

While standard SFT plateaus, WMSS continues to improve — at zero additional inference cost.

Training Curves: SFT vs WMSS

Standard SFT WMSS (Weak-Driven)

Key result: WMSS achieves effective performance improvements on mathematical reasoning (GSM8K, MATH) and code generation (HumanEval, MBPP) benchmarks, purely from improved optimization dynamics during training — no additional inference cost.