MARL Cooperative Navigation: Reward Hacking & Phase 1 Solved
Focus: Debugging reward shaping, stabilizing 2-agent learning.
Returns looked healthy (320–350). Success rate stuck at 10–20%. Agents were gaming the shaping, not solving the task.
Classic reward hacking. Agents camped near landmarks to farm dense rewards without coordinating final coverage. High return does not mean task solved.
Experiments & Fixes
All-covered bonus increased (50 → 150)
Short-term bump (~26% success), then regression. Dense shaping still dominated.
Hold reward reduced (0.2 → 0.05)
Returns stayed high, success stayed low. Still gaming distance shaping.
Distance shaping bug fix (critical)
# Before (buggy)
if not lm.covered or lm.covered_by == agent.id
# After (correct)
if not lm.covered
Agents were getting distance reward for landmarks they were already covering.
Minimal symmetry breaking
Added agent_id / (N-1) scalar to observations. obs_dim: 4 → 5. Enables role differentiation without central control.
Global Minimum Distance shaping (breakthrough)
For each landmark, compute distance to closest agent. Shared reward = -sum(distances) / N. Removes reward cliffs when landmarks flip covered/uncovered.
Collision penalty enabled (-1.0)
Collisions dropped from ~17 to <1 per episode. Movement became deliberate.
Final Phase 1 Config
- Global Min Distance shaping
- One-time coverage reward (+10)
- Collision penalty (-1.0)
- No dense hold reward
- Agent identity enabled
Result: Phase 1 solved. Return ~326, collisions ~0.6, clean splits.
Lesson: High returns with low success means your reward is lying to you.