Argus: Bridge Inspection RL Training

April 1, 2026

Phase 0 Results

Trained a single-drone PPO policy to cover a bridge surface using downward-facing inspection. The task: maximize coverage before battery dies, don’t crash.

Final metrics (500k steps, ~128s on CPU):

Metric	Value
Mean coverage	36.9%
Episode length	574 steps
Collision rate	0%
Terminal cause	Battery (100%)

Coverage is low because one drone can’t cover a 100m bridge in 620 steps. That’s expected. MARL fleet divides the bridge into sections; each drone covers its assigned section. 36.9% single-drone coverage scales to ~100% with N=3 drones.

The policy learned a lawnmower sweep pattern. Serpentine back-and-forth across the y-axis while advancing in x. No path planning code, just reward signal.

Reward Hacking

Found and fixed two exploits during training.

Exploit 1: Boundary Exit

Symptom: Every episode ended via collision. Mean reward was +268 over ~167 steps, but the drone deliberately exited bounds to terminate early after covering 11% of the bridge.

Fix: Raised collision penalty from 10 to 50. Added soft boundary shaping: -0.5 * (10 - dist_to_edge) per step when within 10m of any edge. Added survival bonus of +0.001/step.

Exploit 2: Altitude Hovering

Symptom: After fixing exploit 1, the policy hovered at z=10m (above the 5m coverage ceiling) collecting survival bonus while covering zero cells.

Fix: Added altitude penalty -0.5 * max(0, z - 5.0) per step. At z=10m this gives -300/episode, which dominates the +30 survival bonus. Cut survival weight from 0.05 to 0.001. Lowered reset altitude from 10m to 4.5m so the drone starts inside the coverage zone.

Policy Behavior

Generated 6-panel trajectory visualization from deterministic eval:

All 6 episodes are identical. Policy is fully deterministic.
Coverage forms a contiguous block starting from center-left
Drone runs out of battery at x=80m, leaving right 20m uncovered
Boundary shaping creates visible gaps at y=0 and y=30 edges

Battery Scaling Experiment

Tested the trained policy with different battery capacities:

1x battery (trained): 36.9% coverage 2x battery: 50.9% coverage (clean transfer, lawnmower extrapolates correctly) 4x battery: 7.6% collapse

The 4x case fails because battery observation goes out of distribution. Policy was trained on battery decaying from 1.0 to 0 over 620 steps. At 4x capacity, battery is still at 0.75 at step 620. Policy treats this as “early episode” and repeats start-of-episode behavior indefinitely.

The 2x result is operationally interesting. If hardware allows it, 50% coverage with zero policy changes.

Key Decisions

Phase 0 scope is correct. Single-drone coverage is sufficient for MARL fleet planning.
Lowered curriculum expansion threshold from 0.25 to 0.20 (policy already averaging 0.184 in eval).
Survival bonus stays at 0.001. Low enough not to incentivize hovering, high enough to discourage early termination.
Did not implement progressive milestone bonuses. Coverage reward signal is dense enough after altitude exploit fix.