mHC: Depth Sweep Validation

January 9, 2026

V4 validated the theory. Ran three experiment sets:

Depth sweep (HC only, iso-param ~11M):

DepthDimVal LossMax Amax
63841.1574.67
122561.0706.59
202240.8549.22 ← best loss
241920.9306.10

Sweet spot at depth 20. Depth 24 regresses, possible optimization difficulty at this scale.

Seed variation (depth 24, dims 192):

ModelVal Loss (μ±σ)Max Amax (μ±σ)
HC0.884 ± 0.0336.77 ± 0.60
mHC1.116 ± 0.0121.00 ± 0.00

HC wins on absolute performance but has 3x the variance. mHC’s Sinkhorn constraint eliminates Amax variance entirely: guaranteed stability at the cost of ~0.23 nats.

The tradeoff: At small scale (11M params, 5K steps), unconstrained HC learns faster. The paper’s claim that mHC enables stable training at 27B likely holds. Multiply that 6.77 Amax by L=60+ layers and you get numerical overflow. The doubly stochastic projection isn’t about better optimization, it’s about not exploding.

Total wall time: 17 hours sequential on M3 Max. Generated publication figures with error bars. Part 2 scales to 1B params on A100s this weekend.