DeepSeek mHC: Infrastructure Hell

January 14, 2026

Part 2 of the DeepSeek mHC reproduction. The question: does transformer residual instability survive at scale, or is it a small-model artifact? This covers the infrastructure work to find out. See Part 1 and Part 3: Results.


Jan 12: Scoping Part 2

Decision point:

  • Safe option: ~300M parameters
  • Promised option: 1B+ parameters

Decision: Honor the promise. If the instability doesn’t survive large-scale transformer training, the whole thesis dies.


Jan 12–13: Infrastructure Prep

Major changes:

  • Refactored training for multi-GPU (DDP/FSDP)
  • Switched dataset: TinyShakespeare (1MB) → C4 (300GB+)
  • Added W&B logging, gradient checkpointing, Flash Attention 2, bf16

Built Lambda-specific scripts:

  • train_c4.py (~1,000 LOC)
  • run_experiments.sh
  • setup.sh

Early mistake: Estimated “quick mode” (10k steps, 4 runs) at ~4 hours. Reality: single GPU quick mode took ~14 hours. Full experiment matrix: 60+ hours. Colab Pro was dead on arrival.

Lambda Labs. Started cheap with a single H100 PCIe (80GB).

Immediate problem: OOMs everywhere.

Reality check: At ~1.7B parameters with stream expansion, weights + AdamW + activations ≈ ~78GB.

  • Batch size 32 → dead
  • Batch size 16 → dead
  • Batch size 8 → barely fits

Lesson learned the hard way: Memory math matters.


Jan 14: The 8× H100 Decision

Did the math:

  • Single GPU, full run: ~63 hours
  • Cost savings weren’t worth the time loss
  • Error bars required multiple seeds

Decision: Fuck it. Upgrade to 8× H100 SXM5.

  • ~$20/hr
  • Run experiments in parallel
  • Finish in one workday instead of three

Parallelization issues:

  • Thundering herd on C4 downloads
  • CUDA init conflicts
  • W&B API rate limiting

Fix: Stagger launches by 15 seconds.


Jan 14: Hardware Failure (GPU 3)

One run (hc_d48_s123) kept dying with:

CUDA error: unrecognized error code

Spent ~1 hour suspecting race conditions, tensor corruption, numerical instability.

Reality: GPU 3 was just physically dead.

Evidence:

  • Same run worked fine on other GPUs
  • Other d48 runs succeeded elsewhere

Fix: Abandon GPU 3. Redistribute jobs.

Lesson: Cloud GPUs fail. Assume it will happen.


Jan 14–15: Batch Size Reality Check

Depth 48 models (~2.5B params) exploded instantly at batch size 8.

Fix: Drop to batch size 4.

Accepted tradeoff:

  • Fewer tokens per step
  • Slightly imperfect cross-depth comparison
  • Still valid within each depth

Next: The Bomb