MARL Cooperative Navigation: Gridworld to MAPPO

January 16, 2026

Project Kickoff

I started this project with a deliberately simple goal: build a minimal multi-agent RL environment end-to-end, before worrying about realism or scale.

The initial environment was a classic gridworld:

Discrete cells
Hard walls (no wrapping)
No overlapping agents
State broadcast over WebSockets to Unreal Engine

The intent was to:

How to structure RL environments cleanly
How to decouple simulation from visualization
How Unreal behaves when driven by an external simulator

Early Decisions

Start simple: gridworld first, continuous later
WebSockets for fast iteration (no HTTP yet)
Episode-based training with explicit resets
Visual-first debugging (HUD + agent glow)

By the end of Day 1:

10 agents moved randomly at 10 Hz
State streamed into Unreal Engine
Agents rendered as glowing spheres on a dark plane
Full Python → WebSocket → UE5 pipeline working

This completed Part 1: Infrastructure.

Part 2: Learning (Go-To-Goal)

The first learning task was intentionally trivial:

Each agent has its own goal
Reward = −1 per step, +100 on reaching goal
Agent freezes once successful

This validated:

Episode logic
Reward delivery
HUD metrics
Recording + playback for demos

Agents learned successfully. The system was ready for something harder.

On Jan 17, I pivoted from single-agent goals to Cooperative Navigation, the canonical MARL benchmark (MPE-style):

N agents must cover N landmarks while avoiding collisions. No communication. No roles. Global reward. Local observations.

This task looks trivial. It is not.

Why This Task

Cooperative Navigation exposes:

Symmetry breaking failures
Credit assignment problems
Reward hacking
Coordination deadlocks

If MAPPO can’t solve this cleanly, nothing downstream will.

Key Design Decisions

MAPPO from the start (no tabular Q-learning)
Same algorithm for all phases, difficulty controlled by config
Local-relative observations only
Centralized critic (CTDE)
Simple coverage definition (proximity only)
Continuous positions, discrete actions

Implementation Milestones

By end of Jan 17:

coop_nav/ module complete:

env.py: continuous cooperative navigation environment
mappo.py: MAPPO implementation
train.py: CLI training loop
server.py: visualization server
Logging + event system

Unreal updated:

Agents glow when covering
Landmarks pulse when covered
Collisions flash red
HUD shows coverage, success rate, collisions
Frame recording & replay added
Geometry aligned exactly between physics and Unreal

At this point, everything worked except learning.