You've probably heard that bigger models and more data make AI better. But what if the training itself is the bottleneck?
This post explains a simple but powerful idea from recent research: a model's own weaker past self can make it smarter - no bigger model, no extra data, no added cost at inference. The technique is called WMSS (Weak-Driven Learning), and it improves math and code benchmarks by +6% over standard training.
Before you scroll, ask yourself:
Why does training stall even when a model hasn't fully learned the material? (Hint: it's not overfitting.)
If a student aces practice exams by memorising answers, will they pass the real test? What's missing from their learning?
Can a weaker model teach a stronger one something useful? Sounds backwards - but it works. This post shows you exactly how and why.
↓ Scroll to start
How Are LLMs Trained?
Three steps from raw text to a working AI assistant.
Pre-Training
Learn language from billions of pages of text. Output: a base model that can predict words.
Post-Training
Supervised Fine-Tuning (SFT): Train on expert Q&A examples.
RLHF: Optimise with human preference feedback.
Deployed Model
Ready to chat, answer questions, and assist. This is ChatGPT, Claude, Gemini.
What is Supervised Fine-Tuning?
Teaching a model to follow instructions using curated examples.
Experts Create Data
Human experts write high-quality question-answer pairs. These become the training examples.
Train the Model
Model learns to mimic expert answers.
A: Sunlight scatters off air molecules. Blue light scatters more because of its shorter wavelength.
Model Improves
Now follows instructions and gives helpful answers. Ready for the next step: RLHF alignment.
The Overconfident Student
To understand why SFT hits a ceiling, think about exam prep...
The Journey of a Language Model
From raw text to saturation, and how WMSS breaks through.
Pre-training: Learning Language
The base model learns to predict the next token from a huge text corpus. Its predictions start nearly uniform across the vocabulary.
Context: "The derivative of x² is" → predict next token from vocabulary:
["2x", "x", "2", "x²", "dx"]
Supervised Fine-Tuning & Saturation
SFT sharpens the model's predictions toward the correct answer. But once the model becomes highly confident, gradients vanish and training stalls.
The gradient on non-target tokens equals their probability - as $P(k|x) \to 0$, the gradient vanishes.
Target token: "2x" - click Step to train
Learning from Mistakes
What if the student had a study partner who keeps challenging them with tricky wrong answers?
Three Paradigms for Improving Models
Where does the learning signal come from? Three paradigms - each using a different source and direction of supervision.
Knowledge Distillation
A larger, more capable model teaches a smaller one. Knowledge flows down.
Requires a stronger model. When student catches up, signal vanishes.
Self-Distillation
An earlier checkpoint of the same model provides uncertainty signals. No external model needed.
Uses the model's own history as teacher.
Weak-Driven Learning
An older, weaker checkpoint injects uncertainty upward into the strong model's training signal.
No extra model. No extra inference cost. Just one old checkpoint.
Weak Agent vs Strong Agent
WMSS saves an earlier checkpoint as the Weak Agent and continues training the current model as the Strong Agent. The weak agent retains a softer decision boundary with probability mass on "hard negatives."
Curriculum-Enhanced Data Activation
Not all training samples are equally useful. WMSS uses entropy dynamics between weak and strong agents to weight samples by three signals.
Entropy measures how uncertain a model is about its prediction. A flat distribution = high uncertainty. A peaked distribution = high confidence.
$M_{\text{strong}}$ is more confident - it learned this concept well.
$M_{\text{strong}}$ is less certain - it forgot what was known.
Each training sample gets scored by three signals. The model then focuses training on the most informative samples.
How confused was the weak model? Higher entropy = harder problem = more weight.
Did the strong model improve? If yes (ΔH < 0), revisit to stabilize what was learned.
Did the strong model get worse? If yes (ΔH > 0), up-weight to recover lost ground.
Strong model: confident (H=0.2)
ΔH = -0.6 (learned it)
Strong model: MORE uncertain (H=1.3)
ΔH = +0.5 (got worse!)
Strong model: still uncertain (H=1.6)
ΔH = -0.2 (barely improved)
Joint Training via Logit Mixing
The heart of WMSS: mix the weak agent's logits with the strong agent's logits to create a joint distribution. This reintroduces probability mass on hard negatives, amplifying gradients that had vanished.
Predict next token after "The derivative of x² is" - ground truth = 2x
x: 1.5% · 2: 0.8%
x²: 0.5% · dx: 0.2%
x: 25% · 2: 22%
x²: 15% · dx: 8%
x: 13% · 2: 11%
x²: 8% · dx: 4%
SFT vs WMSS: Epoch by Epoch
Token probability distribution across training epochs - 5-token vocabulary, one correct answer. Watch how SFT saturates while WMSS keeps distractors alive for learning.
Standard SFT
Gradient AliveWMSS
Same StartThe Complete Pipeline
WMSS follows three phases. Watch the algorithm execute step by step.
Mechanistic Insights
WMSS self-regulates through three inherent mechanisms - no hyperparameter tuning required.
The softmax function creates an S-curve. At the extremes (very high or very low logits), the curve is flat - gradients vanish. WMSS shifts tokens back to the steep middle region.
z very positive: P approaches 100%
At extremes, curve is flat = tiny gradients
The weak model's influence naturally fades as the strong model improves. No manual tuning needed.
Shifting all scores by the same amount doesn't change which token wins. Only relative gaps matter.
What Each Method Actually Does
How it works - and whether it solves the gradient problem.
Standard SFT
NEFTune
UNDIAL
WMSS
Click a method card above to see how it works.
Breaking the Saturation Ceiling
Concrete results on Qwen3-8B-Base · 2 epochs · averaged over 3 runs.
| Method | Math Avg | Code Avg | vs SFT (Math) | vs SFT (Code) |
|---|---|---|---|---|
| SFT baseline | 66.7% | 71.2% | — | — |
| UNDIAL | 67.7% | 70.4% | +1.0% | −0.8% |
| NEFTune | 68.5% | 72.4% | +1.8% | +1.2% |
| WMSS | 72.9% | 77.6% | +6.2% | +6.4% |
Each bar is an independent experiment vs SFT baseline - not cumulative.
What WMSS Hasn't Proven Yet
Scope boundaries stated directly in the paper. No speculation added.
Model Scale
Only Qwen3-4B and 8B tested. Behaviour on 70B+ models is completely unknown.
Task Domains
Math reasoning and code generation only. No NLU, translation, or dialogue results reported.
Architecture
All experiments use the Qwen3 family. Generalisation to Llama, Mistral, or others not demonstrated.
Weak Agent Selection
Always uses $M_0$, the pre-SFT checkpoint. What makes an optimal weak agent remains unexplored.
Over-Optimisation
AMC2023 regresses after Epoch 3; GSM8K shows volatility. Epoch 4 risks catastrophic forgetting.
Mixing Coefficient $\lambda$
Performance is sensitive to $\lambda$. Peak average at $\lambda$=0.42; drops meaningfully at extremes.
Where It Fits, What It Proves, What Comes Next
The Problem
SFT saturates once models grow confident. Gradients vanish - more training stops helping.
The only known fixes: bigger models or more data.
The Contribution
- Mix weak + strong logits ($\lambda$=0.5)
- Inflates distractor probabilities, keeping gradients alive past saturation
- Zero extra inference cost
- +6.2% Math avg, +6.4% Code avg vs SFT
- Distractor logits suppressed by 56.9%
Open Questions
- 70B+ models untested
- Math + code only
- One architecture (Qwen3)
- No RLHF comparison
- Optimal weak checkpoint selection unexplored