Argus: Crack Detection Pipeline

April 2, 2026

Pipeline Setup

Built a full crack detection pipeline from scratch. Downloaded two datasets, wrote auto-labeling scripts, generated synthetic data, and started YOLOv8m fine-tuning.

Datasets:

SDNET2018: bridge decks, walls, pavements (cracked/uncracked)
DeepCrack: segmentation masks for crack boundaries

Final dataset after merging: 20,616 images (9,021 positive, 11,595 negative)

Split: 16,492 train / 2,060 val / 2,064 test

Data Preparation

Built perception/data_prep.py to auto-generate YOLO bbox labels:

SDNET2018 approach:

Otsu thresholding on cracked images
Connected components to extract crack regions
Generated bounding boxes from components

DeepCrack approach:

Mask to contour extraction
Contour to bounding box conversion

Stratified 80/10/10 split by source. Wrote data/crack_detection.yaml for YOLO training.

Synthetic Augmentation

Built perception/synth_augment.py to generate synthetic cracks using OpenCV:

Perlin-style noise textures
Random walk cracks with momentum
Headless rendering (no GUI)

Generated 2,000 synthetic images (1,600 train / 400 val). Merged into main dataset. Wrote data/crack_detection_augmented.yaml.

The narrative: test OpenCV synthetics first, measure delta mAP, use that to justify UE5 pipeline investment.

Training Infrastructure

Built perception/train_yolo.py:

YOLOv8m fine-tuning
MPS device acceleration (M4 Max)
AdamW optimizer

Built perception/infer.py:

Smoke test
detect_frame() API for downstream integration

Built tests/test_perception.py:

Label format validation
Synthetic generator tests
API contract tests

MPS Bugs

Hit three PyTorch MPS bugs during training.

Bug 1: AMP Tensor Size Mismatch

Error: RuntimeError: size of tensor a must match tensor b

Fix: Disabled AMP on MPS (amp=False)

Bug 2: Advanced Indexing in TAL

Error: torch.AcceleratorError: index out of bounds in ultralytics/utils/tal.py:195

Fix: Monkey-patched TaskAlignedAssigner.get_box_metrics to route through CPU, move outputs back to MPS. PYTORCH_ENABLE_MPS_FALLBACK=1 did not fix this. The bug is in a supported op, not a fallback case.

Bug 3: OOM Kill During NMS

Error: Process exit 137 during validation

Fix: Added conf=0.25 and max_det=100 to training calls to cap prediction counts before NMS.

Training Status

Run run_20260401_185126: 6/10 epochs complete before process died.

Best mAP50 so far: 0.315 at epoch 6 (epoch 5: 0.310, trending up)

Checkpoint saved at last.pt, resumable.

Resume command:

python perception/train_yolo.py --quick --resume && python perception/train_yolo.py --epochs 50

The 50-epoch baseline run will start fresh from yolov8m.pt pretrained weights, not continue from epoch 10.

Key Decisions

Used full dataset (18,090 train images) instead of fraction=0.1 quick mode due to resume code path bug. Treated as a happy accident since 10 epochs on full data is more valuable than a smoke test.
MPS monkey-patch preferred over CPU-only fallback. Keeps ~1.2 it/s throughput. Only the broken function routes to CPU.
Resume bug identified: --resume code path didn’t pass fraction parameter, so --quick flag had no effect when resuming.