MARL Cooperative Navigation: Reward Shaping and Local Optima
The Problem Appears
Training ran, but behavior stalled:
- Returns climbed steadily
- Collisions dropped
- Agents navigated competently
- Success rate stayed low (10–20%)
- Coverage plateaued (~35–45%)
This is the worst-case MARL failure mode: everyone looks competent, nobody finishes together.
Reward Shaping Failures (In Order)
Failure 1: Dense Coverage Reward
Symptom: Agents rush to first landmark, then park.
Cause: Per-step reward incentivized camping.
Fix: Coverage reward became one-time only.
Failure 2: No Persistence
Symptom: Agents tag landmarks then wander off.
Cause: No incentive to stay once reward collected.
Fix: Add small dense hold reward.
Failure 3: False Distance Incentives
Symptom: Agents optimize distance shaping while already covering.
Cause: Distance reward included already-covered landmarks.
Fix: Distance shaping only applies to truly uncovered landmarks.
Failure 4: Weak Global Signal
Symptom: Agents avoid risky coordination.
Cause: Terminal bonus too small relative to shaping.
Fix: Increase all_covered_bonus (50 → 150).
Observed Learning Dynamics
With fixes applied:
- Agents find landmarks reliably
- Collisions decline
- Partial coordination emerges
- Occasional success spikes (~25–26%)
- Policy fails to lock in stable simultaneous coverage
Returns remain high (~330–350), which is misleading.
This is a local optimum: do well individually, avoid coordination risk.
Current Reward Structure (Phase 1)
distance_reward_scale: -1.0 (only uncovered landmarks)
coverage_reward: +10 (one-time)
hold_reward: +0.05 / step
all_covered_bonus: +150
step_penalty: -0.1
collision_penalty: off
Despite careful shaping, the task remains difficult. This is expected.
Key Lessons (So Far)
This benchmark is not easy. Cooperative Navigation is deceptively hard, even with MAPPO.
High return ≠ success. Policies can look “good” while failing the task.
Reward shaping is fragile. Every dense reward introduces a new exploit.
Visualization matters. Without Unreal, this would look “fine” on paper.
This matches the literature. These are known, unresolved MARL problems. Now experienced firsthand.
Why This Still Matters
Even without “solving” the task, this project already demonstrates:
- Real MARL failure modes
- Correct MAPPO implementation
- End-to-end training + visualization
- Why attention, asymmetry, or curriculum are needed next