Argus: Bridge Inspection RL Training
Phase 0 Results
Trained a single-drone PPO policy to cover a bridge surface using downward-facing inspection. The task: maximize coverage before battery dies, don’t crash.
Final metrics (500k steps, ~128s on CPU):
| Metric | Value |
|---|---|
| Mean coverage | 36.9% |
| Episode length | 574 steps |
| Collision rate | 0% |
| Terminal cause | Battery (100%) |
Coverage is low because one drone can’t cover a 100m bridge in 620 steps. That’s expected. MARL fleet divides the bridge into sections; each drone covers its assigned section. 36.9% single-drone coverage scales to ~100% with N=3 drones.
The policy learned a lawnmower sweep pattern. Serpentine back-and-forth across the y-axis while advancing in x. No path planning code, just reward signal.
Reward Hacking
Found and fixed two exploits during training.
Exploit 1: Boundary Exit
Symptom: Every episode ended via collision. Mean reward was +268 over ~167 steps, but the drone deliberately exited bounds to terminate early after covering 11% of the bridge.
Fix: Raised collision penalty from 10 to 50. Added soft boundary shaping: -0.5 * (10 - dist_to_edge) per step when within 10m of any edge. Added survival bonus of +0.001/step.
Exploit 2: Altitude Hovering
Symptom: After fixing exploit 1, the policy hovered at z=10m (above the 5m coverage ceiling) collecting survival bonus while covering zero cells.
Fix: Added altitude penalty -0.5 * max(0, z - 5.0) per step. At z=10m this gives -300/episode, which dominates the +30 survival bonus. Cut survival weight from 0.05 to 0.001. Lowered reset altitude from 10m to 4.5m so the drone starts inside the coverage zone.
Policy Behavior
Generated 6-panel trajectory visualization from deterministic eval:
- All 6 episodes are identical. Policy is fully deterministic.
- Coverage forms a contiguous block starting from center-left
- Drone runs out of battery at x=80m, leaving right 20m uncovered
- Boundary shaping creates visible gaps at y=0 and y=30 edges
Battery Scaling Experiment
Tested the trained policy with different battery capacities:
1x battery (trained): 36.9% coverage 2x battery: 50.9% coverage (clean transfer, lawnmower extrapolates correctly) 4x battery: 7.6% collapse
The 4x case fails because battery observation goes out of distribution. Policy was trained on battery decaying from 1.0 to 0 over 620 steps. At 4x capacity, battery is still at 0.75 at step 620. Policy treats this as “early episode” and repeats start-of-episode behavior indefinitely.
The 2x result is operationally interesting. If hardware allows it, 50% coverage with zero policy changes.
Key Decisions
- Phase 0 scope is correct. Single-drone coverage is sufficient for MARL fleet planning.
- Lowered curriculum expansion threshold from 0.25 to 0.20 (policy already averaging 0.184 in eval).
- Survival bonus stays at 0.001. Low enough not to incentivize hovering, high enough to discourage early termination.
- Did not implement progressive milestone bonuses. Coverage reward signal is dense enough after altitude exploit fix.