DeepSeek mHC: Infrastructure Hell
Part 2 of the DeepSeek mHC reproduction. The question: does transformer residual instability survive at scale, or is it a small-model artifact? This covers the infrastructure work to find out. See Part 1 and Part 3: Results.
Jan 12: Scoping Part 2
Decision point:
- Safe option: ~300M parameters
- Promised option: 1B+ parameters
Decision: Honor the promise. If the instability doesn’t survive large-scale transformer training, the whole thesis dies.
Jan 12–13: Infrastructure Prep
Major changes:
- Refactored training for multi-GPU (DDP/FSDP)
- Switched dataset: TinyShakespeare (1MB) → C4 (300GB+)
- Added W&B logging, gradient checkpointing, Flash Attention 2, bf16
Built Lambda-specific scripts:
train_c4.py(~1,000 LOC)run_experiments.shsetup.sh
Early mistake: Estimated “quick mode” (10k steps, 4 runs) at ~4 hours. Reality: single GPU quick mode took ~14 hours. Full experiment matrix: 60+ hours. Colab Pro was dead on arrival.
Lambda Labs. Started cheap with a single H100 PCIe (80GB).
Immediate problem: OOMs everywhere.
Reality check: At ~1.7B parameters with stream expansion, weights + AdamW + activations ≈ ~78GB.
- Batch size 32 → dead
- Batch size 16 → dead
- Batch size 8 → barely fits
Lesson learned the hard way: Memory math matters.
Jan 14: The 8× H100 Decision
Did the math:
- Single GPU, full run: ~63 hours
- Cost savings weren’t worth the time loss
- Error bars required multiple seeds
Decision: Fuck it. Upgrade to 8× H100 SXM5.
- ~$20/hr
- Run experiments in parallel
- Finish in one workday instead of three
Parallelization issues:
- Thundering herd on C4 downloads
- CUDA init conflicts
- W&B API rate limiting
Fix: Stagger launches by 15 seconds.
Jan 14: Hardware Failure (GPU 3)
One run (hc_d48_s123) kept dying with:
CUDA error: unrecognized error code
Spent ~1 hour suspecting race conditions, tensor corruption, numerical instability.
Reality: GPU 3 was just physically dead.
Evidence:
- Same run worked fine on other GPUs
- Other d48 runs succeeded elsewhere
Fix: Abandon GPU 3. Redistribute jobs.
Lesson: Cloud GPUs fail. Assume it will happen.
Jan 14–15: Batch Size Reality Check
Depth 48 models (~2.5B params) exploded instantly at batch size 8.
Fix: Drop to batch size 4.
Accepted tradeoff:
- Fewer tokens per step
- Slightly imperfect cross-depth comparison
- Still valid within each depth
Next: The Bomb