The Journey of a Language Model
From raw text to saturation, and how WMSS breaks through.
Pre-training: Learning Language
The base model learns to predict the next token from a huge text corpus. Its predictions start nearly uniform across the vocabulary.
Context: "The cat sat on the" → predict next token from vocabulary:
["cat", "dog", "car", "hat", "the"]
Supervised Fine-Tuning & Saturation
SFT sharpens the model's predictions toward the correct answer. But once the model becomes highly confident, gradients vanish and training stalls.
The gradient on non-target tokens equals their probability — as $P(k|x) \to 0$, the gradient vanishes.
Target token: "cat" — click Step to train
Weak Agent vs Strong Agent
WMSS saves an earlier checkpoint as the Weak Agent and continues training the current model as the Strong Agent. The weak agent retains a softer decision boundary with probability mass on “hard negatives.”
Curriculum-Enhanced Data Activation
Not all training samples are equally useful. WMSS uses entropy dynamics between weak and strong agents to weight samples by three signals: base difficulty, consolidation, and regression repair.
Joint Training via Logit Mixing
The heart of WMSS: mix the weak agent's logits with the strong agent's logits to create a joint distribution. This reintroduces probability mass on hard negatives, amplifying gradients that had vanished.
The Complete Pipeline
WMSS follows three phases. Watch the algorithm execute step by step.
Breaking the Saturation Ceiling
While standard SFT plateaus, WMSS continues to improve — at zero additional inference cost.