DeepSeek mHC: The Bomb
Part 3 of the DeepSeek mHC reproduction. Training at scale reveals the real behavior: signal amplification in transformers reaches 10,000×+ with unconstrained Hyper-Connections. See Part 1: Setup and Part 2: Infrastructure.
Jan 15: Training Begins
First SSH into the Lambda box (Central Texas). W&B dashboards start lighting up.
HC behavior:
- Amax starts creeping by step ~500
- Hits 1000× by ~2000 steps
- Keeps climbing
mHC behavior:
- Flat
- 1.0
- Every seed
- Every depth
No crashes. No NaNs. Just quiet violence.
Jan 15: The Bomb Realization
HC amplification reached:
- 10,924× (d32)
- 3,721× (d48)
- 14,765× in stress tests
And nothing crashed.
Hypothesis: Gradient clipping (norm = 1.0) is doing heroic, invisible work.
Key discovery: The instability starts at Layer 0, the input embedding. Not deep in the network. Not a late-stage effect. The first mixing matrix hits raw embeddings before LayerNorm.
That was not in the original paper.
Visualization & Writing
Generated:
- Main comparison plots
- Scaling projections
- Seed variance plots
- Layer-wise heatmaps
- Stress test curves
Built animations:
- Amax counter
- Layer heatmap (Layer 0 “canary”)
- Signal flow amplification
Final State
What worked:
- HC instability survives scale
- mHC fixes it cleanly
- No performance tradeoff
- Zero variance across seeds
What surprised me:
- Instability starts at Layer 0
- Gradient clipping masks a lot
- 1.7B > 27B in raw amplification (non-monotonic scaling)
What’s next:
- Find compute sponsor
- Push to 10B
- See if the bomb finally detonates