mHC: Sinkhorn-Knopp Implementation

January 6, 2026

Read the DeepSeek mHC paper. The idea: replace single residual stream with n parallel streams (expansion_rate=4) and learnable H_in/H_res/H_out mixing matrices. The composite gain Amax = max(||H_res||_∞, ||H_res||_1) determines signal amplification through depth.

Initial implementation: HyperConnection class with unconstrained H_res, ManifoldHyperConnection with Sinkhorn-Knopp projection (10 iterations, ε=1e-3) enforcing doubly stochastic constraint. This guarantees Amax=1.0 by making row and column sums equal 1.

Architecture: GPT-2 style decoder, ~10M params baseline. Weight tying between tok_emb and lm_head. Character-level TinyShakespeare for fast iteration.

First training run overnight. 5000 steps, cosine LR decay from 3e-4.