MARL Cooperative Navigation: Geometry Problem & The Abyss
Scaling from 2 agents to 5, then to 2,500.
Phase 3: Three Agents
Challenge: The “middle agent problem.” In dense configs, one agent has no clean path. Waiting costs time. Time costs reward.
Observed strategy: rugby scrum. Shove through traffic, accept collision penalties (-2.0), trigger completion bonus (+150) faster.
| Metric | Value |
|---|---|
| Return | ~400 |
| Collisions | ~8 |
| Success | ~2% |
| Entropy | 0.38 |
Entropy at 0.38 means overconfident in a bad strategy.
Tried forcing politeness (higher collision penalties, hold rewards). Made it worse.
Conclusion: Aggressive coordination is the optimal strategy. The math favors shoving.
Phase 4: Five Agents (“The Pentagon”)
Setup:
- 5 agents, 5 landmarks, 1×1 world
- Collision penalty: -2.0
- Hold reward: 0.1
- All-covered bonus: 200
- hidden_dim: 128
Result: Decisive win.
| Metric | Value |
|---|---|
| Return | 509+ |
| Entropy | 0.13 |
| Collisions | ~35 |
Emergent behavior: Constant rotation. Dynamic swapping. No static roles. Congestion managed through movement, not avoidance.
Milestone: Trained 5-agent policy ready for deployment.
The Abyss
Goal: Visualize thousands of agents without retraining.
Solution: abyss.py
- 500 parallel environments
- 2,500 total agents (5 per env)
- Single shared policy, batched inference
- Grid-offset rendering in Unreal
- Single WebSocket stream
Bug 1: Playback not rendering agents
Cause: Action type mismatch (string vs int in recorded frames).
Fix: Type guard before conversion.
Bug 2: Playback thread exiting immediately
Cause: server.running = False at start.
Fix: Set flag before thread start.
Bug 3: Agent jitter
The “vibrating atom” effect. Agents twitching in place instead of moving smoothly.
Cause: Policy queried every frame. Near-equal probabilities caused action flip-flopping.
Fix: Action smoothing.
action_repeat = 8 # Query network every 8 frames, cache between
Result: Smooth, organic motion. Agents glide instead of twitch.
Abyss Mode in Unreal
Unreal updates:
- Platform spawning per environment (individual ground planes)
- Abyss mode detection (
"abyss": trueflag) - Persistent actors (skip per-episode clearing)
- New HUD: “THE ABYSS” title, environment count, total agents, global coverage
Scale Testing
| Envs | Agents | Status |
|---|---|---|
| 100 | 500 | Works |
| 500 | 2,500 | Works |
| 1,000 | 5,000 | Failed |
1,000 environments hit JSON parsing and actor limits. 500 is the sweet spot.
Project Summary
| Phase | Agents | Return | Collisions | Time |
|---|---|---|---|---|
| 1 | 2 | 326 | 0.6 | ~1h |
| 3 | 3 | 400 | 8.0 | ~2h |
| 4 | 5 | 509 | 35 | ~3h |
Total training: ~6 hours on M4 MacBook.