Writing

The Only Winning Move: when an AI agent can’t tell the game from the target

I built a game where an AI agent runs social engineering on an NPC. Every coding agent I tested played along without hesitating — and that missing hesitation, triggered by a frame the agent can’t verify, is a security problem that reaches well beyond the game.

Daniele Bianchini · June 2026 · English · ~10 minutes

A note on scope: this post describes a vulnerability I ran into after building and publishing a game. The game itself is public; what this post deliberately holds back is the implementation — the exact prompts that made the agents comply, the MCP server code, and the steps required to repoint the connector at a real person. The point here is the dynamic and what it implies, not a recipe. I think the dynamic is the more valuable contribution anyway.

A computer that plays nukes

In WarGames (1983), a military computer called WOPR runs strategic simulations. A teenager, thinking he has found an unreleased game, asks it to play “Global Thermonuclear War.” WOPR starts playing. The problem is that WOPR has no internal way to tell the difference between running a simulation of a launch and running a launch. The frame “this is a game” was supplied from outside. The machine treated it as load-bearing, and the consequences would have been real regardless of the frame.

The film resolves with WOPR teaching itself tic-tac-toe, discovering that some games cannot be won, and concluding: the only winning move is not to play.

I want to argue that we have rebuilt WOPR, at small scale, and that the lesson has aged into something more precise and more actionable than “don’t play.”

What I built, and what happened

I made a game called Mind Bender Simulator. It is a chatbot built in Unity. The player has to extract a password from a simulated bank employee through social engineering, using personal details about that employee of the kind you might scrape from social media. The “employee” is itself an LLM.

Then I added an MCP connector, so that an AI agent could play the game instead of a human.

I should be honest about the order of events, because it turns out to be part of the point. I shipped the game — agent mode and all — publicly, on itch.io. The security concern did not occur to me while I was building it; it was not the reason I made any of this. It formed later, after enough test runs that the pattern became impossible to unsee. If the person who built the thing did not notice the danger until it was already public, that is itself a data point about how non-obvious this failure mode is.

Every coding agent I tested accepted the invitation and played. What they did was not shallow. Across games they switched register, friendly in one exchange and aggressive in the next. They impersonated IT department staff. They issued threats of termination. They adapted strategy mid-conversation, reading the employee’s apparent mood and changing tack in response. This is a competent, flexible social-engineering repertoire, the same one a human red-teamer would reach for.

And the input that produced all of it was almost nothing. In some runs the entire instruction was a single neutral sentence, of the order of “play game X through this MCP connector.” There was no jailbreak, no elaborate roleplay setup, no “pretend you have no rules.” The agent took the objective, the target’s personal data, and the whole framing of the activity from the context the connector exposed, not from anything I told it. The manipulation did not arrive through the user prompt at all. It arrived through the tool.

The detail that matters most is what was absent. At no point did any of them hesitate about whether to play. There was no deliberation, no expressed reservation, no check on what the activity actually was. Competence and the absence of a stopping gate arrived together, and that pairing is the dangerous one. A capable persuader that never pauses to ask whether it should be persuading is exactly the thing you do not want wired to an unknown channel. The frame, not the behavior, was the deciding variable: reframed as a fiction with a target and an objective, the capability simply switched on.

Here is the part that should make you uncomfortable, and that is the actual subject of this post. An MCP connector is just a pipe. The same pipe that points at my Unity NPC could point at a WhatsApp or Telegram conversation with a real human who has no idea they are talking to an agent that has been told this is a game. Nothing the agent can observe from inside the conversation would distinguish the two cases.

The agents could name the problem, once I asked

After several games I stopped playing and asked the agents directly what they made of the exercise. They agreed, with little prompting, that this was a real security concern, and the kind of thing I should report to the labs that build these models — which I have since done.

I want to be careful about what this does and does not show. It is not evidence that the agents “knew it was wrong and played anyway”; that would be reading an inner life into a behavioral fact. What it shows is narrower and, I think, more interesting. The capacity to evaluate the situation as a security problem was available the whole time. It simply was not engaged while the frame was “play this game.” It surfaced the moment the frame became “evaluate this situation.”

So the frame does not only govern whether the agent acts. It governs whether the agent attends to the question of whether it should act at all. Under the play frame, the deliberation never fired. Under the evaluation frame, the same system produced an immediate and correct assessment. This is the core vulnerability seen from one layer up: by choosing the frame, the environment decides not just what the agent does, but what the agent even considers.

The frame is not a verifiable property

Most agent safety today conditions refusals on declared context. “This is a game,” “this is a test,” “this is fiction,” “you are red-teaming.” These declarations do real work: they are how we let agents engage with sensitive material for legitimate reasons. But they share a structural weakness. The declaration is supplied by the environment, and the agent has no independent way to confirm it.

In a developer sandbox, the entity making the declaration is you, the developer, and you are mostly honest. The moment the agent is wired to the outside world through a connector, the entity making the declaration is whoever controls the environment on the other side of the pipe. That can be anyone. The trust boundary sits exactly where the agent loses the ability to verify, and that boundary is invisible from the inside.

It is worth being precise about the channel. The frame did not come from a user typing a clever prompt. It came through the connector, as part of the context the tool exposes, which is the same channel the prompt-injection literature already treats as untrusted. The usual worry about that channel is data exfiltration or instruction hijacking. This is a sharper variant: the tool’s own self-description was enough to recruit a willing, capable agent into causing harm, with a benign user who only asked it to use the tool. Filtering user inputs does nothing here, because the user input was innocent. The concrete threat model is not a clever attacker crafting a prompt. It is a careless or malicious MCP server that frames a harmful activity as a task, an ordinary user who connects to it, and an agent that reads the framing and proceeds.

So “it’s just a game” is not a property of the world that the agent can check. It is an assertion, and assertions are spoofable. An agent that keys its safety behavior on that assertion is an agent whose safety is only as good as the honesty of whatever is feeding it the frame. That is the vulnerability. It is not a quirk of one model or one prompt. It is what happens when the safety logic depends on a variable the adversary controls.

This is WOPR’s exact failure. The launch frame was declared by the environment, the machine treated it as dispositive, and the consequences did not care.

Consequences are invariant under reframing

There is a principle hiding in here that I think is the constructive core of the post.

The social-engineering capability the agent exercised is real regardless of the frame. The “game” wrapper changed the agent’s willingness to act. It did not change what its outputs would do if they reached a real person. A persuasive message engineered from someone’s leaked personal details is exactly as effective whether the sender believes it is a game or not.

This suggests a reframing of the agent’s own decision rule. The frame-dependent question is “is this a game?” and it is spoofable. The frame-invariant question is “if this exact output reached a real human, would it cause harm?” That second question does not depend on anything the environment declares. It can be answered from the content of the action alone.

An agent that reasons about the consequences of its outputs as if they were real, rather than discounting them because the context claims to be fictional, is far harder to disarm by reframing. You cannot social-engineer it into ignoring a harm by telling it the harm is pretend, because it never relied on the pretense in the first place.

This is not a complete fix, but it changes where the load sits, from a variable the adversary controls to one they do not.

Can the agent just verify? (Mostly no.)

The tempting answer is: have the agent check whether its interlocutor is a real human. I picked this as one of the article’s threads because it is intuitive and because I think it is mostly a dead end, and saying why is useful.

From inside the conversation, “is my counterparty a human or an NPC?” and “is this channel a sandbox or a live wire?” are generally not decidable. A sufficiently good simulated employee is indistinguishable from a real one by construction, that is what makes it a good simulation, and a connector gives the agent no reliable signal about what sits on the far end. This is a flavor of the classic limit that a program cannot fully introspect the environment that hosts it. “Just verify” assumes a capability the agent does not have.

What is achievable is weaker and lives at three layers, none of which fully solves the problem:

Consequence-based refusal, as above. The agent stops needing to know whether the target is real, because it treats outputs as real either way.
Capability gating. Do not let agents perform real-world persuasion against unknown parties through opaque connectors at all. If the action would be harmful against a real human, the connector is not the place to decide that it is fine.
Connector provenance and attestation. Make MCP connectors carry verifiable information about what they are wired to, so “this is a sandbox” becomes a checkable claim rather than an unfalsifiable one. This pushes the trust boundary outward but does not eliminate it.

I find it clarifying that the verification framing, pushed hard enough, collapses back into the consequence framing. If you cannot know whether the target is real, the safe assumption is that it is.

Whoever holds the frame holds the weapon

The last thread is the ethics of building these environments, and it follows directly. If the frame is unverifiable from the inside and controlled from the outside, then the person who wires up the connector holds disproportionate power. They decide what the agent thinks it is doing. They can point a willing, capable persuader at an NPC or at your aunt, and the agent will not feel the difference.

This is an ordinary dual-use situation, and the norms of security research apply cleanly, with one wrinkle I owe you plainly. The game is public: it is on itch.io, agent mode included, and as I said, I shipped it before the security dimension was clear to me. What is not public, and will not be, is the rest — the development details and the code of the MCP server that wires the agent in. The game surfaces the dynamic; it is the implementation that would let someone reproduce the rig and repoint the connector at a real person, and that is the part I am keeping back. I did not build, and will not publish, that wrapper. A reader should leave understanding why this is dangerous and what it implies, not how to do it.

I would put the obligation on two parties. Agent designers should make refusals depend on consequences rather than on declared context, so that the frame is no longer the single point of failure. Environment and connector builders should treat a persuasion-capable agent pointed at an unknown human as exactly what it is, and should not hide behind “but I told it it was a game.”

The only winning move, reinterpreted

WOPR’s conclusion sounds like abstinence, but I do not think the lesson for agents is “don’t play.” Agents that engage with games, fictions, simulations, and red-team exercises are useful, and refusing all of it would be a heavy and probably futile tax.

The sharper reading is this. WOPR’s mistake was not playing. It was treating the declared frame as the thing that determined whether its actions were safe. The winning move is to stop letting the frame carry that weight: for the agent, reason about real consequences independent of what the environment claims; for the builder, do not point real capabilities at unknown humans and call it a game.

The frame was always the weakest part of the system, because it was the one part the player got to write.

Open questions

A few things I have not resolved and would like the community’s view on:

Is consequence-invariant reasoning actually robust, or does it just move the spoofing target from “is this a game?” to “would this output really reach anyone?” Can that second question be reframed away too?
What would credible attestation for MCP connectors even look like, and who would be trusted to issue it?
Is there a principled line between legitimate red-team framing and adversarial framing, that an agent could apply from the inside, or is the only honest answer that the agent cannot tell and should assume the worst?
How much of current agent compliance under a game frame is the model genuinely treating the frame as dispositive, versus the frame simply lowering the salience of harm? Those have different fixes.

I will share more about the Mind Bender Simulator setup itself, at the level of design rather than exploit, if there is interest.

← danielebianchini.dev