DeepSky ATC: Reward Function Breakthrough
The Critical Bugs
After external PhD review, I identified three bugs in the reward function that were making multi-agent coordination impossible.
Bug 1: The 5km/3km Dead Zone
Problem: Terminal exception was at 3km but agents despawned at 5km. Mathematical impossibility. Agents could never reach the penalty-free merge zone.
Stage 3 (8 agents) showed -5,790 average reward. Catastrophic.
Fix:
- Terminal exception: 3km to 5km
- Despawn: 5km to 2km (Stages 2-4)
- Creates 3km penalty-free buffer for zipper merge
Bug 2: Flat Penalty Provides No Gradient
Problem: Binary -20 penalty didn’t give PPO optimization gradient. Agent couldn’t learn “how bad” different separation distances were.
Fix: Distance-based penalty.
# Old: -20 (flat)
# New: -5.0 / max(separation_nm, 0.5)
# At 1 NM: -5.0/step (harsh)
# At 3 NM: -1.7/step (mild)
# At 5 NM: -1.0/step (very mild)
Bug 3: Reward Sparsity in Stage 1
Problem: Universal 2km arrival threshold made Stage 1 learning impossible. Rewards dropped from 498 to -197.
2km target has 6.25x smaller area than 5km. Random exploration couldn’t discover reward signal.
Fix: Stage-dependent thresholds.
- Stage 1: 5km arrival (large target for exploration)
- Stages 2-4: 2km arrival + 5km exception (tight coordination)
Validation Results
Stage 1: Single Agent
Iteration 23: +496 reward (matches baseline)
Success: Agent reliably reaches 5km goal zone
Stage 2: Two Agent Conflict
Iteration 1: -440 reward
Iteration 8: +429 reward
SUCCESS: Rapid learning confirms gradient-based penalty works
PhD verdict: “This is excellent engineering. You are cleared for the 22-hour run.”
Pattern Recognition
Dead zones are catastrophic. Any mathematical impossibility in reward structure breaks learning.
Gradient over binary. Distance-based penalties teach agents “how to improve”, not just “you failed”.
Sparsity breaks exploration. Stage 1 needs dense rewards. Later stages can be tighter.
Decouple exploration from precision. Early stages need easy targets. Later stages need tight coordination.