DeepSeek mHC: The Bomb

January 15, 2026

Part 3 of the DeepSeek mHC reproduction. Training at scale reveals the real behavior: signal amplification in transformers reaches 10,000×+ with unconstrained Hyper-Connections. See Part 1: Setup and Part 2: Infrastructure.

Jan 15: Training Begins

First SSH into the Lambda box (Central Texas). W&B dashboards start lighting up.

HC behavior:

Amax starts creeping by step ~500
Hits 1000× by ~2000 steps
Keeps climbing

mHC behavior:

Flat
1.0
Every seed
Every depth

No crashes. No NaNs. Just quiet violence.

Jan 15: The Bomb Realization

HC amplification reached:

10,924× (d32)
3,721× (d48)
14,765× in stress tests

And nothing crashed.

Hypothesis: Gradient clipping (norm = 1.0) is doing heroic, invisible work.

Key discovery: The instability starts at Layer 0, the input embedding. Not deep in the network. Not a late-stage effect. The first mixing matrix hits raw embeddings before LayerNorm.

That was not in the original paper.

Visualization & Writing

Generated:

Main comparison plots
Scaling projections
Seed variance plots
Layer-wise heatmaps
Stress test curves

Built animations:

Amax counter
Layer heatmap (Layer 0 “canary”)
Signal flow amplification

Final State

What worked:

HC instability survives scale
mHC fixes it cleanly
No performance tradeoff
Zero variance across seeds

What surprised me:

Instability starts at Layer 0
Gradient clipping masks a lot
1.7B > 27B in raw amplification (non-monotonic scaling)

What’s next:

Find compute sponsor
Push to 10B
See if the bomb finally detonates