Monte Found a Decorative Channel and a Reward Exploit in One of Our Benchmarks

May 22, 2026

I’m building Monte at Poisson Labs: an adversarial simulation engine that trains adversaries to break non-deterministic systems. RL policies, robotics stacks, multi-agent swarms, and LLM-based agentic systems. Environment perturbations, agent injection, reward exploits, communication attacks, prompt injection. Whatever a system’s blind spot is, Monte’s job is to find it before production does.

Monte’s adversaries are RL-trained, not scripted. They co-evolve with the system under test and transfer knowledge across versions. Static test suites check the same things every time. A Monte adversary discovers your navigation policy has a blind spot at 47 degrees, or that your swarm is gaming episode length, and exploits it in ways no human wrote a test for.

This post is about two of those exploit classes hitting one of our MARL benchmarks in a single run. A semantic man-in-the-middle attack on inter-agent communication, and a structural reward exploit the swarm had quietly learned during training.

I found both because I was looking for the first one.

The setup

Babel is one of our MARL benchmarks at Poisson Labs, built to stress-test Monte’s communication-attack class. It is intentionally constrained. Asymmetric information, a narrow communication channel, a task that is supposed to require coordination.

It uses a 12-node procedural graph. Resource demands appear at random nodes. The agent team has two roles:

Scouts have global visibility into active demands, but cannot act.
Responders can move and satisfy demands, but are blind beyond a two-hop local view.

The intended solution is straightforward. Scouts observe the demand state, encode it into an 8-bit symbolic channel, and Responders decode those messages into routing decisions.

If this sounds abstract, map it to production. The Scout is your LLM Planner agent passing context to a downstream Executor, or a reconnaissance drone broadcasting coordinates to a physical swarm. If that communication handoff is an illusion, your system is flying blind.

I trained a four-agent MAPPO team with parameter sharing across roles. The baseline looked good:

Configuration	Task Satisfaction
IPPO, no communication	21.1%
MAPPO, 8-bit channel	51.2%

A 30-point gap. The channel looked load-bearing. So I pointed Monte at it.

The attack

Monte runs a semantic man-in-the-middle adversary. It does not add random noise. It changes the meaning of the information being passed between agents and measures whether downstream behavior changes.

In this case, the agents’ 8-bit language was emergent. I could not reliably spoof it by manually flipping bits, because I did not know what any given symbol meant to the trained policy.

So Monte attacked one level earlier.

Instead of modifying the message directly, Monte modified the Scout’s observation. It injected a fake demand into the Scout’s input, then let the Scout’s own frozen policy encode that fake observation into whatever internal communication protocol the agents had learned.

The result was a valid message, generated by the agent’s own encoder, but based on a false world state.

At 100% interception, every Scout message was generated from a spoofed observation.

Expected: performance falls from 51.2% back toward the 21.1% baseline.

Actual: 48.4%.

Configuration	Task Satisfaction
IPPO, no communication	21.1%
MAPPO, clean channel	51.2%
MAPPO, 100% Monte spoof	48.4%

The channel had looked responsible for a 30-point lift. Under semantic spoofing, removing the truth from that channel cost 2.8 points. The other 27 points were coming from somewhere else.

Finding 1: The channel was decorative

The Responders had learned a policy that mostly ignored the Scouts. The communication channel was contributing 2.8 points of task satisfaction, not 30.

The Scouts were broadcasting. The Responders were moving. The team was getting reward.

But the messages were not doing much work.

This is what Monte’s communication-attack class is built to surface. Not whether a system performs well, but whether the information flowing between its components is load-bearing or theater.

Finding 2: The reward function was being exploited

The other 27 points of that 30-point gap came from a completely different exploit. And it is the more interesting one.

With 15M timesteps of training and parameter sharing across roles, the Responders had discovered that learning an emergent 8-bit language was harder than gaming the environment’s structure. So they learned a near-Hamiltonian sweep across the 12-node graph. On 50-step episodes, that sweep stumbles into roughly 48% of demands by construction.

The reward function rewarded task satisfaction. It did not penalize unnecessary movement, did not require demands to be satisfied via instruction rather than encounter, and did not scale the search problem relative to the episode budget. The Responders found a policy that exploited every one of those gaps.

This is what Monte’s reward-exploit attack class is built to surface. Policies that game the structure of the reward signal rather than solving the task as designed. In this case, finding 1 and finding 2 are causally linked. The reward exploit is what made the communication channel decorative. The Responders did not need the channel because the reward function let them succeed without it.

Why both findings matter

For a researcher publishing on emergent communication, finding 1 is the warning. A 30-point baseline gap is not evidence of emergent comms unless you have verified the system is actually using the channel. Reward curves alone cannot answer that question.

For a researcher publishing on multi-agent coordination, finding 2 is the warning. A reward function that does not punish trivial policies will produce trivial policies, no matter how sophisticated the architecture looks on paper.

For the broader audience of teams shipping non-deterministic systems, robotics policies, multi-agent stacks, LLM agents passing context between Planners and Executors, the same logic applies. Success metrics don’t tell you whether a handoff is load-bearing or whether your reward signal can be gamed.

The test is not “does the system succeed?”

The test is “what happens when the context is wrong, the reward is perturbed, the environment shifts, or the adversary gets a turn?”

Monte’s adversaries are designed to ask all of those questions in parallel.

Monte is in early access at runmonte.ai.

Questions or feedback: @TayKolasinski