MARL Cooperative Navigation: Geometry Problem & The Abyss

January 21, 2026

Scaling from 2 agents to 5, then to 2,500.


Phase 3: Three Agents

Challenge: The “middle agent problem.” In dense configs, one agent has no clean path. Waiting costs time. Time costs reward.

Observed strategy: rugby scrum. Shove through traffic, accept collision penalties (-2.0), trigger completion bonus (+150) faster.

MetricValue
Return~400
Collisions~8
Success~2%
Entropy0.38

Entropy at 0.38 means overconfident in a bad strategy.

Tried forcing politeness (higher collision penalties, hold rewards). Made it worse.

Conclusion: Aggressive coordination is the optimal strategy. The math favors shoving.


Phase 4: Five Agents (“The Pentagon”)

Setup:

  • 5 agents, 5 landmarks, 1×1 world
  • Collision penalty: -2.0
  • Hold reward: 0.1
  • All-covered bonus: 200
  • hidden_dim: 128

Result: Decisive win.

MetricValue
Return509+
Entropy0.13
Collisions~35

Emergent behavior: Constant rotation. Dynamic swapping. No static roles. Congestion managed through movement, not avoidance.

Milestone: Trained 5-agent policy ready for deployment.


The Abyss

Goal: Visualize thousands of agents without retraining.

Solution: abyss.py

  • 500 parallel environments
  • 2,500 total agents (5 per env)
  • Single shared policy, batched inference
  • Grid-offset rendering in Unreal
  • Single WebSocket stream

Bug 1: Playback not rendering agents

Cause: Action type mismatch (string vs int in recorded frames).

Fix: Type guard before conversion.

Bug 2: Playback thread exiting immediately

Cause: server.running = False at start.

Fix: Set flag before thread start.

Bug 3: Agent jitter

The “vibrating atom” effect. Agents twitching in place instead of moving smoothly.

Cause: Policy queried every frame. Near-equal probabilities caused action flip-flopping.

Fix: Action smoothing.

action_repeat = 8  # Query network every 8 frames, cache between

Result: Smooth, organic motion. Agents glide instead of twitch.


Abyss Mode in Unreal

Unreal updates:

  • Platform spawning per environment (individual ground planes)
  • Abyss mode detection ("abyss": true flag)
  • Persistent actors (skip per-episode clearing)
  • New HUD: “THE ABYSS” title, environment count, total agents, global coverage

Scale Testing

EnvsAgentsStatus
100500Works
5002,500Works
1,0005,000Failed

1,000 environments hit JSON parsing and actor limits. 500 is the sweet spot.


Project Summary

PhaseAgentsReturnCollisionsTime
123260.6~1h
334008.0~2h
4550935~3h

Total training: ~6 hours on M4 MacBook.