DeepSky ATC: Reward Function Breakthrough

March 15, 2026

The Critical Bugs

After external PhD review, I identified three bugs in the reward function that were making multi-agent coordination impossible.

Bug 1: The 5km/3km Dead Zone

Problem: Terminal exception was at 3km but agents despawned at 5km. Mathematical impossibility. Agents could never reach the penalty-free merge zone.

Stage 3 (8 agents) showed -5,790 average reward. Catastrophic.

Fix:

  • Terminal exception: 3km to 5km
  • Despawn: 5km to 2km (Stages 2-4)
  • Creates 3km penalty-free buffer for zipper merge

Bug 2: Flat Penalty Provides No Gradient

Problem: Binary -20 penalty didn’t give PPO optimization gradient. Agent couldn’t learn “how bad” different separation distances were.

Fix: Distance-based penalty.

# Old: -20 (flat)
# New: -5.0 / max(separation_nm, 0.5)
# At 1 NM: -5.0/step (harsh)
# At 3 NM: -1.7/step (mild)
# At 5 NM: -1.0/step (very mild)

Bug 3: Reward Sparsity in Stage 1

Problem: Universal 2km arrival threshold made Stage 1 learning impossible. Rewards dropped from 498 to -197.

2km target has 6.25x smaller area than 5km. Random exploration couldn’t discover reward signal.

Fix: Stage-dependent thresholds.

  • Stage 1: 5km arrival (large target for exploration)
  • Stages 2-4: 2km arrival + 5km exception (tight coordination)

Validation Results

Stage 1: Single Agent

Iteration 23: +496 reward (matches baseline)
Success: Agent reliably reaches 5km goal zone

Stage 2: Two Agent Conflict

Iteration 1: -440 reward
Iteration 8: +429 reward
SUCCESS: Rapid learning confirms gradient-based penalty works

PhD verdict: “This is excellent engineering. You are cleared for the 22-hour run.”


Pattern Recognition

Dead zones are catastrophic. Any mathematical impossibility in reward structure breaks learning.

Gradient over binary. Distance-based penalties teach agents “how to improve”, not just “you failed”.

Sparsity breaks exploration. Stage 1 needs dense rewards. Later stages can be tighter.

Decouple exploration from precision. Early stages need easy targets. Later stages need tight coordination.