mHC: Stream Persistence Fix
Training loss curves looked wrong. HC was performing identically to baseline residual. No Amax growth, no depth scaling effects. Ran implementation audits with Claude and Gemini. Both verified the Sinkhorn math and matrix operations. “Looks correct.”
Wrong question. Asked: “What shape tensor flows between transformer blocks?”
Found it. Each TransformerBlock.forward() was calling expand() on entry and collapse() on exit. The n-stream representation wasn’t persisting. I had n independent residual connections, not HyperConnections.
Fix: Move expand/collapse to model level. Blocks receive (B, T, n, D) tensors and maintain stream dimension throughout. H_res now actually composes across depth L, giving composite gain Amax^L.
Refactored model.py. Added per-layer gradient norm logging to train.py for gradient flow analysis. V4 running overnight.