Argus: Crack Detection Pipeline
Pipeline Setup
Built a full crack detection pipeline from scratch. Downloaded two datasets, wrote auto-labeling scripts, generated synthetic data, and started YOLOv8m fine-tuning.
Datasets:
- SDNET2018: bridge decks, walls, pavements (cracked/uncracked)
- DeepCrack: segmentation masks for crack boundaries
Final dataset after merging: 20,616 images (9,021 positive, 11,595 negative)
Split: 16,492 train / 2,060 val / 2,064 test
Data Preparation
Built perception/data_prep.py to auto-generate YOLO bbox labels:
SDNET2018 approach:
- Otsu thresholding on cracked images
- Connected components to extract crack regions
- Generated bounding boxes from components
DeepCrack approach:
- Mask to contour extraction
- Contour to bounding box conversion
Stratified 80/10/10 split by source. Wrote data/crack_detection.yaml for YOLO training.
Synthetic Augmentation
Built perception/synth_augment.py to generate synthetic cracks using OpenCV:
- Perlin-style noise textures
- Random walk cracks with momentum
- Headless rendering (no GUI)
Generated 2,000 synthetic images (1,600 train / 400 val). Merged into main dataset. Wrote data/crack_detection_augmented.yaml.
The narrative: test OpenCV synthetics first, measure delta mAP, use that to justify UE5 pipeline investment.
Training Infrastructure
Built perception/train_yolo.py:
- YOLOv8m fine-tuning
- MPS device acceleration (M4 Max)
- AdamW optimizer
Built perception/infer.py:
- Smoke test
detect_frame()API for downstream integration
Built tests/test_perception.py:
- Label format validation
- Synthetic generator tests
- API contract tests
MPS Bugs
Hit three PyTorch MPS bugs during training.
Bug 1: AMP Tensor Size Mismatch
Error: RuntimeError: size of tensor a must match tensor b
Fix: Disabled AMP on MPS (amp=False)
Bug 2: Advanced Indexing in TAL
Error: torch.AcceleratorError: index out of bounds in ultralytics/utils/tal.py:195
Fix: Monkey-patched TaskAlignedAssigner.get_box_metrics to route through CPU, move outputs back to MPS. PYTORCH_ENABLE_MPS_FALLBACK=1 did not fix this. The bug is in a supported op, not a fallback case.
Bug 3: OOM Kill During NMS
Error: Process exit 137 during validation
Fix: Added conf=0.25 and max_det=100 to training calls to cap prediction counts before NMS.
Training Status
Run run_20260401_185126: 6/10 epochs complete before process died.
Best mAP50 so far: 0.315 at epoch 6 (epoch 5: 0.310, trending up)
Checkpoint saved at last.pt, resumable.
Resume command:
python perception/train_yolo.py --quick --resume && python perception/train_yolo.py --epochs 50
The 50-epoch baseline run will start fresh from yolov8m.pt pretrained weights, not continue from epoch 10.
Key Decisions
- Used full dataset (18,090 train images) instead of
fraction=0.1quick mode due to resume code path bug. Treated as a happy accident since 10 epochs on full data is more valuable than a smoke test. - MPS monkey-patch preferred over CPU-only fallback. Keeps ~1.2 it/s throughput. Only the broken function routes to CPU.
- Resume bug identified:
--resumecode path didn’t passfractionparameter, so--quickflag had no effect when resuming.