MARL Cooperative Navigation: Reward Hacking & Phase 1 Solved

January 18, 2026

Focus: Debugging reward shaping, stabilizing 2-agent learning.

Returns looked healthy (320–350). Success rate stuck at 10–20%. Agents were gaming the shaping, not solving the task.

Classic reward hacking. Agents camped near landmarks to farm dense rewards without coordinating final coverage. High return does not mean task solved.

Experiments & Fixes

All-covered bonus increased (50 → 150)

Short-term bump (~26% success), then regression. Dense shaping still dominated.

Hold reward reduced (0.2 → 0.05)

Returns stayed high, success stayed low. Still gaming distance shaping.

Distance shaping bug fix (critical)

# Before (buggy)
if not lm.covered or lm.covered_by == agent.id

# After (correct)
if not lm.covered

Agents were getting distance reward for landmarks they were already covering.

Minimal symmetry breaking

Added agent_id / (N-1) scalar to observations. obs_dim: 4 → 5. Enables role differentiation without central control.

Global Minimum Distance shaping (breakthrough)

For each landmark, compute distance to closest agent. Shared reward = -sum(distances) / N. Removes reward cliffs when landmarks flip covered/uncovered.

Collision penalty enabled (-1.0)

Collisions dropped from ~17 to <1 per episode. Movement became deliberate.

Final Phase 1 Config

Global Min Distance shaping
One-time coverage reward (+10)
Collision penalty (-1.0)
No dense hold reward
Agent identity enabled

Result: Phase 1 solved. Return ~326, collisions ~0.6, clean splits.

Lesson: High returns with low success means your reward is lying to you.