Argus: Bridge Inspection RL Training

April 1, 2026

Phase 0 Results

Trained a single-drone PPO policy to cover a bridge surface using downward-facing inspection. The task: maximize coverage before battery dies, don’t crash.

Final metrics (500k steps, ~128s on CPU):

MetricValue
Mean coverage36.9%
Episode length574 steps
Collision rate0%
Terminal causeBattery (100%)

Coverage is low because one drone can’t cover a 100m bridge in 620 steps. That’s expected. MARL fleet divides the bridge into sections; each drone covers its assigned section. 36.9% single-drone coverage scales to ~100% with N=3 drones.

The policy learned a lawnmower sweep pattern. Serpentine back-and-forth across the y-axis while advancing in x. No path planning code, just reward signal.


Reward Hacking

Found and fixed two exploits during training.

Exploit 1: Boundary Exit

Symptom: Every episode ended via collision. Mean reward was +268 over ~167 steps, but the drone deliberately exited bounds to terminate early after covering 11% of the bridge.

Fix: Raised collision penalty from 10 to 50. Added soft boundary shaping: -0.5 * (10 - dist_to_edge) per step when within 10m of any edge. Added survival bonus of +0.001/step.

Exploit 2: Altitude Hovering

Symptom: After fixing exploit 1, the policy hovered at z=10m (above the 5m coverage ceiling) collecting survival bonus while covering zero cells.

Fix: Added altitude penalty -0.5 * max(0, z - 5.0) per step. At z=10m this gives -300/episode, which dominates the +30 survival bonus. Cut survival weight from 0.05 to 0.001. Lowered reset altitude from 10m to 4.5m so the drone starts inside the coverage zone.


Policy Behavior

Generated 6-panel trajectory visualization from deterministic eval:

  • All 6 episodes are identical. Policy is fully deterministic.
  • Coverage forms a contiguous block starting from center-left
  • Drone runs out of battery at x=80m, leaving right 20m uncovered
  • Boundary shaping creates visible gaps at y=0 and y=30 edges

Battery Scaling Experiment

Tested the trained policy with different battery capacities:

1x battery (trained): 36.9% coverage 2x battery: 50.9% coverage (clean transfer, lawnmower extrapolates correctly) 4x battery: 7.6% collapse

The 4x case fails because battery observation goes out of distribution. Policy was trained on battery decaying from 1.0 to 0 over 620 steps. At 4x capacity, battery is still at 0.75 at step 620. Policy treats this as “early episode” and repeats start-of-episode behavior indefinitely.

The 2x result is operationally interesting. If hardware allows it, 50% coverage with zero policy changes.


Key Decisions

  • Phase 0 scope is correct. Single-drone coverage is sufficient for MARL fleet planning.
  • Lowered curriculum expansion threshold from 0.25 to 0.20 (policy already averaging 0.184 in eval).
  • Survival bonus stays at 0.001. Low enough not to incentivize hovering, high enough to discourage early termination.
  • Did not implement progressive milestone bonuses. Coverage reward signal is dense enough after altitude exploit fix.