DeepSeek mHC: The Bomb

January 15, 2026

Part 3 of the DeepSeek mHC reproduction. Training at scale reveals the real behavior: signal amplification in transformers reaches 10,000×+ with unconstrained Hyper-Connections. See Part 1: Setup and Part 2: Infrastructure.


Jan 15: Training Begins

First SSH into the Lambda box (Central Texas). W&B dashboards start lighting up.

HC behavior:

  • Amax starts creeping by step ~500
  • Hits 1000× by ~2000 steps
  • Keeps climbing

mHC behavior:

  • Flat
  • 1.0
  • Every seed
  • Every depth

No crashes. No NaNs. Just quiet violence.


Jan 15: The Bomb Realization

HC amplification reached:

  • 10,924× (d32)
  • 3,721× (d48)
  • 14,765× in stress tests

And nothing crashed.

Hypothesis: Gradient clipping (norm = 1.0) is doing heroic, invisible work.

Key discovery: The instability starts at Layer 0, the input embedding. Not deep in the network. Not a late-stage effect. The first mixing matrix hits raw embeddings before LayerNorm.

That was not in the original paper.


Visualization & Writing

Generated:

  • Main comparison plots
  • Scaling projections
  • Seed variance plots
  • Layer-wise heatmaps
  • Stress test curves

Built animations:

  • Amax counter
  • Layer heatmap (Layer 0 “canary”)
  • Signal flow amplification

Final State

What worked:

  • HC instability survives scale
  • mHC fixes it cleanly
  • No performance tradeoff
  • Zero variance across seeds

What surprised me:

  • Instability starts at Layer 0
  • Gradient clipping masks a lot
  • 1.7B > 27B in raw amplification (non-monotonic scaling)

What’s next:

  • Find compute sponsor
  • Push to 10B
  • See if the bomb finally detonates