mHC: Depth Sweep Validation
V4 validated the theory. Ran three experiment sets:
Depth sweep (HC only, iso-param ~11M):
| Depth | Dim | Val Loss | Max Amax |
|---|---|---|---|
| 6 | 384 | 1.157 | 4.67 |
| 12 | 256 | 1.070 | 6.59 |
| 20 | 224 | 0.854 | 9.22 ← best loss |
| 24 | 192 | 0.930 | 6.10 |
Sweet spot at depth 20. Depth 24 regresses, possible optimization difficulty at this scale.
Seed variation (depth 24, dims 192):
| Model | Val Loss (μ±σ) | Max Amax (μ±σ) |
|---|---|---|
| HC | 0.884 ± 0.033 | 6.77 ± 0.60 |
| mHC | 1.116 ± 0.012 | 1.00 ± 0.00 |
HC wins on absolute performance but has 3x the variance. mHC’s Sinkhorn constraint eliminates Amax variance entirely: guaranteed stability at the cost of ~0.23 nats.
The tradeoff: At small scale (11M params, 5K steps), unconstrained HC learns faster. The paper’s claim that mHC enables stable training at 27B likely holds. Multiply that 6.77 Amax by L=60+ layers and you get numerical overflow. The doubly stochastic projection isn’t about better optimization, it’s about not exploding.
Total wall time: 17 hours sequential on M3 Max. Generated publication figures with error bars. Part 2 scales to 1B params on A100s this weekend.