More data and bigger models get most of the attention. The training process itself can be the bottleneck.
This post explains one idea from a 2026 paper: a model's own weaker past self can make it smarter — no bigger model, no extra data, no added inference cost. The technique is called WMSS (Weak-Driven Learning), and it improves math and code benchmarks by +6% over standard training.
Three questions worth sitting with before you read:
Why does training stall even when a model hasn't fully learned the material? (Hint: it's not overfitting.)
If a student aces practice exams by memorising answers, will they pass the real test? What's missing from their learning?
Can a weaker model teach a stronger one something useful? Sounds backwards — but it works. Here's exactly how and why.
↓ Scroll to start
How Are LLMs Trained?
Three steps from raw text to a working AI assistant.
Pre-Training
Learn language from billions of pages of text. Output: a base model that can predict words.
Post-Training
Supervised Fine-Tuning (SFT): Train on expert Q&A examples.
RLHF: Optimise with human preference feedback.
Deployed Model
Ready to chat, answer questions, and assist. This is ChatGPT, Claude, Gemini.
What is Supervised Fine-Tuning?
Teaching a model to follow instructions using curated examples.
Experts Create Data
Human experts write high-quality question-answer pairs. These become the training examples.
Train the Model
Model learns to mimic expert answers.
A: Sunlight scatters off air molecules. Blue light scatters more because of its shorter wavelength.
Model Improves
Now follows instructions and gives helpful answers. Ready for the next step: RLHF alignment.
The Overconfident Student
SFT hits a ceiling for the same reason a student can ace practice tests and still fail the final.
The Journey of a Language Model
From raw text to saturation, and how WMSS breaks through.
Pre-training: Learning Language
The base model learns to predict the next token from a huge text corpus. Its predictions start nearly uniform across the vocabulary.
Context: "The derivative of x² is" → predict next token from vocabulary:
["2x", "x", "2", "x²", "dx"]
Supervised Fine-Tuning & Saturation
SFT sharpens the model's predictions toward the correct answer. But once the model becomes highly confident, gradients vanish and training stalls.
The gradient on non-target tokens equals their probability - as $P(k|x) \to 0$, the gradient vanishes.
Target token: "2x" - click Step to train
Learning from Mistakes
What if the student had a study partner who keeps challenging them with tricky wrong answers?
Three Paradigms for Improving Models
Three paradigms, each drawing from a different source of supervision.
Knowledge Distillation
A larger, more capable model teaches a smaller one. Knowledge flows down.
Requires a stronger model. When student catches up, signal vanishes.
Self-Distillation
An earlier checkpoint of the same model provides uncertainty signals. No external model needed.
Uses the model's own history as teacher.
Weak-Driven Learning
An older, weaker checkpoint injects uncertainty upward into the strong model's training signal.
Just one old checkpoint — no extra model, no added inference cost.
Weak Agent vs Strong Agent
WMSS saves an earlier checkpoint as the Weak Agent and continues training the current model as the Strong Agent. The weak agent retains a softer decision boundary with probability mass on "hard negatives."
Curriculum-Enhanced Data Activation
WMSS weights training samples using entropy dynamics between the weak and strong agents — three signals that together determine how much each sample is worth.
Entropy measures how uncertain a model is about its prediction. A flat distribution = high uncertainty. A peaked distribution = high confidence.
$M_{\text{strong}}$ is more confident - it learned this concept well.
$M_{\text{strong}}$ is less certain - it forgot what was known.
Each training sample gets scored by three signals. The model then focuses training on the most informative samples.
How confused was the weak model? Higher entropy = harder problem = more weight.
Did the strong model improve? If yes (ΔH < 0), revisit to stabilize what was learned.
Did the strong model get worse? If yes (ΔH > 0), up-weight to recover lost ground.
Strong model: confident (H=0.2)
ΔH = -0.6 (learned it)
Strong model: MORE uncertain (H=1.3)
ΔH = +0.5 (got worse!)
Strong model: still uncertain (H=1.6)
ΔH = -0.2 (barely improved)
Joint Training via Logit Mixing
Mix the weak agent's logits with the strong agent's to create a joint distribution. This puts probability mass back on hard negatives, amplifying the gradients that had vanished.
Predict next token after "The derivative of x² is" - ground truth = 2x
x: 1.5% · 2: 0.8%
x²: 0.5% · dx: 0.2%
x: 25% · 2: 22%
x²: 15% · dx: 8%
x: 13% · 2: 11%
x²: 8% · dx: 4%
SFT vs WMSS: Epoch by Epoch
Token probability distribution across training epochs - 5-token vocabulary, one correct answer. Watch how SFT saturates while WMSS keeps distractors alive for learning.
Standard SFT
Gradient AliveWMSS
Same StartThe Complete Pipeline
WMSS runs in three phases.
Mechanistic Insights
WMSS self-regulates — three properties that fall out of the math without any hyperparameter tuning.
The softmax function creates an S-curve. At the extremes (very high or very low logits), the curve is flat - gradients vanish. WMSS shifts tokens back to the steep middle region.
z very positive: P approaches 100%
At extremes, curve is flat = tiny gradients
The weak model's influence naturally fades as the strong model improves. No manual tuning needed.
Shifting all scores by the same amount doesn't change which token wins. Only relative gaps matter.
What Each Method Actually Does
How it works - and whether it solves the gradient problem.
Standard SFT
NEFTune
UNDIAL
WMSS
Click a method card above to see how it works.
Breaking the Saturation Ceiling
Concrete results on Qwen3-8B-Base · 2 epochs · averaged over 3 runs.
| Method | Math Avg | Code Avg | vs SFT (Math) | vs SFT (Code) |
|---|---|---|---|---|
| SFT baseline | 66.7% | 71.2% | — | — |
| UNDIAL | 67.7% | 70.4% | +1.0% | −0.8% |
| NEFTune | 68.5% | 72.4% | +1.8% | +1.2% |
| WMSS | 72.9% | 77.6% | +6.2% | +6.4% |
Each bar is an independent experiment vs SFT baseline - not cumulative.
What WMSS Hasn't Proven Yet
Taken directly from the paper — no extrapolation.
Model Scale
Only Qwen3-4B and 8B tested. Behaviour on 70B+ models is completely unknown.
Task Domains
Math reasoning and code generation only. No NLU, translation, or dialogue results reported.
Architecture
All experiments use the Qwen3 family. Generalisation to Llama, Mistral, or others not demonstrated.
Weak Agent Selection
Always uses $M_0$, the pre-SFT checkpoint. What makes an optimal weak agent remains unexplored.
Over-Optimisation
AMC2023 regresses after Epoch 3; GSM8K shows volatility. Epoch 4 risks catastrophic forgetting.
Mixing Coefficient $\lambda$
Performance is sensitive to $\lambda$. Peak average at $\lambda$=0.42; drops meaningfully at extremes.
Where It Fits, What It Proves, What Comes Next
The Problem
SFT saturates once models grow confident. Gradients vanish - more training stops helping.
The standard fixes are a bigger model or more data.
The Contribution
- Mix weak + strong logits ($\lambda$=0.5)
- Inflates distractor probabilities, keeping gradients alive past saturation
- Zero extra inference cost
- +6.2% Math avg, +6.4% Code avg vs SFT
- Distractor logits suppressed by 56.9%
Open Questions
- 70B+ models untested
- Math + code only
- One architecture (Qwen3)
- No RLHF comparison
- Optimal weak checkpoint selection unexplored