DeepSeek mHC: Infrastructure Hell

January 14, 2026

Part 2 of the DeepSeek mHC reproduction. The question: does transformer residual instability survive at scale, or is it a small-model artifact? This covers the infrastructure work to find out. See Part 1 and Part 3: Results.

Jan 12: Scoping Part 2

Decision point:

Safe option: ~300M parameters
Promised option: 1B+ parameters

Decision: Honor the promise. If the instability doesn’t survive large-scale transformer training, the whole thesis dies.

Jan 12–13: Infrastructure Prep

Major changes:

Refactored training for multi-GPU (DDP/FSDP)
Switched dataset: TinyShakespeare (1MB) → C4 (300GB+)
Added W&B logging, gradient checkpointing, Flash Attention 2, bf16

Built Lambda-specific scripts:

train_c4.py (~1,000 LOC)
run_experiments.sh
setup.sh

Early mistake: Estimated “quick mode” (10k steps, 4 runs) at ~4 hours. Reality: single GPU quick mode took ~14 hours. Full experiment matrix: 60+ hours. Colab Pro was dead on arrival.

Lambda Labs. Started cheap with a single H100 PCIe (80GB).

Immediate problem: OOMs everywhere.

Reality check: At ~1.7B parameters with stream expansion, weights + AdamW + activations ≈ ~78GB.

Batch size 32 → dead
Batch size 16 → dead
Batch size 8 → barely fits

Lesson learned the hard way: Memory math matters.

Jan 14: The 8× H100 Decision

Did the math:

Single GPU, full run: ~63 hours
Cost savings weren’t worth the time loss
Error bars required multiple seeds

Decision: Fuck it. Upgrade to 8× H100 SXM5.

~$20/hr
Run experiments in parallel
Finish in one workday instead of three

Parallelization issues:

Thundering herd on C4 downloads
CUDA init conflicts
W&B API rate limiting

Fix: Stagger launches by 15 seconds.

Jan 14: Hardware Failure (GPU 3)

One run (hc_d48_s123) kept dying with:

CUDA error: unrecognized error code

Spent ~1 hour suspecting race conditions, tensor corruption, numerical instability.

Reality: GPU 3 was just physically dead.

Evidence:

Same run worked fine on other GPUs
Other d48 runs succeeded elsewhere

Fix: Abandon GPU 3. Redistribute jobs.

Lesson: Cloud GPUs fail. Assume it will happen.

Jan 14–15: Batch Size Reality Check

Depth 48 models (~2.5B params) exploded instantly at batch size 8.

Fix: Drop to batch size 4.

Accepted tradeoff:

Fewer tokens per step
Slightly imperfect cross-depth comparison
Still valid within each depth

Next: The Bomb