MARL Cooperative Navigation: Gridworld to MAPPO

January 16, 2026

Project Kickoff

I started this project with a deliberately simple goal: build a minimal multi-agent RL environment end-to-end, before worrying about realism or scale.

The initial environment was a classic gridworld:

  • Discrete cells
  • Hard walls (no wrapping)
  • No overlapping agents
  • State broadcast over WebSockets to Unreal Engine

The intent was to:

  • How to structure RL environments cleanly
  • How to decouple simulation from visualization
  • How Unreal behaves when driven by an external simulator

Early Decisions

  • Start simple: gridworld first, continuous later
  • WebSockets for fast iteration (no HTTP yet)
  • Episode-based training with explicit resets
  • Visual-first debugging (HUD + agent glow)

By the end of Day 1:

  • 10 agents moved randomly at 10 Hz
  • State streamed into Unreal Engine
  • Agents rendered as glowing spheres on a dark plane
  • Full Python → WebSocket → UE5 pipeline working

This completed Part 1: Infrastructure.


Part 2: Learning (Go-To-Goal)

The first learning task was intentionally trivial:

  • Each agent has its own goal
  • Reward = −1 per step, +100 on reaching goal
  • Agent freezes once successful

This validated:

  • Episode logic
  • Reward delivery
  • HUD metrics
  • Recording + playback for demos

Agents learned successfully. The system was ready for something harder.


Pivot: Cooperative Navigation (MARL)

On Jan 17, I pivoted from single-agent goals to Cooperative Navigation, the canonical MARL benchmark (MPE-style):

N agents must cover N landmarks while avoiding collisions. No communication. No roles. Global reward. Local observations.

This task looks trivial. It is not.


Why This Task

Cooperative Navigation exposes:

  • Symmetry breaking failures
  • Credit assignment problems
  • Reward hacking
  • Coordination deadlocks

If MAPPO can’t solve this cleanly, nothing downstream will.


Key Design Decisions

  • MAPPO from the start (no tabular Q-learning)
  • Same algorithm for all phases, difficulty controlled by config
  • Local-relative observations only
  • Centralized critic (CTDE)
  • Simple coverage definition (proximity only)
  • Continuous positions, discrete actions

Implementation Milestones

By end of Jan 17:

coop_nav/ module complete:

  • env.py: continuous cooperative navigation environment
  • mappo.py: MAPPO implementation
  • train.py: CLI training loop
  • server.py: visualization server
  • Logging + event system

Unreal updated:

  • Agents glow when covering
  • Landmarks pulse when covered
  • Collisions flash red
  • HUD shows coverage, success rate, collisions
  • Frame recording & replay added
  • Geometry aligned exactly between physics and Unreal

At this point, everything worked except learning.