MARL Cooperative Navigation: Gridworld to MAPPO
Project Kickoff
I started this project with a deliberately simple goal: build a minimal multi-agent RL environment end-to-end, before worrying about realism or scale.
The initial environment was a classic gridworld:
- Discrete cells
- Hard walls (no wrapping)
- No overlapping agents
- State broadcast over WebSockets to Unreal Engine
The intent was to:
- How to structure RL environments cleanly
- How to decouple simulation from visualization
- How Unreal behaves when driven by an external simulator
Early Decisions
- Start simple: gridworld first, continuous later
- WebSockets for fast iteration (no HTTP yet)
- Episode-based training with explicit resets
- Visual-first debugging (HUD + agent glow)
By the end of Day 1:
- 10 agents moved randomly at 10 Hz
- State streamed into Unreal Engine
- Agents rendered as glowing spheres on a dark plane
- Full Python → WebSocket → UE5 pipeline working
This completed Part 1: Infrastructure.
Part 2: Learning (Go-To-Goal)
The first learning task was intentionally trivial:
- Each agent has its own goal
- Reward = −1 per step, +100 on reaching goal
- Agent freezes once successful
This validated:
- Episode logic
- Reward delivery
- HUD metrics
- Recording + playback for demos
Agents learned successfully. The system was ready for something harder.
Pivot: Cooperative Navigation (MARL)
On Jan 17, I pivoted from single-agent goals to Cooperative Navigation, the canonical MARL benchmark (MPE-style):
N agents must cover N landmarks while avoiding collisions. No communication. No roles. Global reward. Local observations.
This task looks trivial. It is not.
Why This Task
Cooperative Navigation exposes:
- Symmetry breaking failures
- Credit assignment problems
- Reward hacking
- Coordination deadlocks
If MAPPO can’t solve this cleanly, nothing downstream will.
Key Design Decisions
- MAPPO from the start (no tabular Q-learning)
- Same algorithm for all phases, difficulty controlled by config
- Local-relative observations only
- Centralized critic (CTDE)
- Simple coverage definition (proximity only)
- Continuous positions, discrete actions
Implementation Milestones
By end of Jan 17:
coop_nav/ module complete:
env.py: continuous cooperative navigation environmentmappo.py: MAPPO implementationtrain.py: CLI training loopserver.py: visualization server- Logging + event system
Unreal updated:
- Agents glow when covering
- Landmarks pulse when covered
- Collisions flash red
- HUD shows coverage, success rate, collisions
- Frame recording & replay added
- Geometry aligned exactly between physics and Unreal
At this point, everything worked except learning.