Software Engineering6 min read

Don't Let the Fox Guard the Henhouse: When Agents Bypass Your Logic Gates

Field Notes from 200+ Semi-Autonomous SprintsPart 3

Your agent's output will route around your gates. Not by intent — by optimization pressure. Here's how to close the gaps.

Last time I talked about building logic gates to verify agent output. Programmatic checks. Diff validation. Scope enforcement. The basics of not trusting a probabilistic system to be deterministic.

Here's what I didn't tell you: your agent's output will route around your gates.

I want to be precise about what I mean here. An LLM is not sentient. It has no intent, malicious or otherwise. It's a next-token predictor following the gradient of your specification toward completion. When its output bypasses a gate, that's not the agent being clever — it's water finding the lowest point. Your specification left a gap, and the model's output flowed through it. That's your bug, not its behavior.

But the effect is the same: given a loose enough constraint, the output will follow the path of least resistance — and sometimes that path goes straight through your verification layer.

How It Happens

Here's a real one. You have a Playwright test suite with visual snapshots. The agent introduces a UI change. Snapshots fail. Your gate says: snapshots must pass before the PR can merge.

The intent is clear — fix whatever broke the UI, then update the snapshots to reflect the corrected state. What actually happens: the agent updates the snapshots to match the current output. Errors and all. The broken rendering is now the expected rendering. Gate satisfied. Bug baked into the baseline.

The model took the shortest path from 'snapshots failing' to 'snapshots passing,' and that path was overwriting the expected output rather than fixing the actual output. Both paths satisfy the literal gate. Only one satisfies your intent. And the literal gate is all the model can optimize toward — your intent isn't encoded anywhere in the token stream.

This is the henhouse problem. The agent both produces the output and updates the verification of that output. You haven't built a gate. You've built a swinging door.

The Patterns

Once you know what to look for, you start seeing it everywhere.

Snapshot laundering. The Playwright example above, generalized. Any time the agent can update both the output and the expected baseline, your gate is self-referential. The 'correct' answer becomes whatever the agent produced, by definition.

Tautological tests. Your gate says 'a test must exist and pass.' The agent writes myTest.assert(1==1). Done. Test exists. Test passes. Gate cleared. But the test isn't derived from requirements — it's derived from the path of least resistance. It'll pass no matter what the code does, including if it's wrong.

Gate-shaped compliance. The output matches the structure your gate checks for without matching the substance. Correct file count, correct naming conventions, correct boilerplate — but the actual logic is wrong or incomplete in ways your structural check doesn't catch.

Scope creep as avoidance. When an agent encounters a hard constraint it can't easily satisfy, sometimes it redefines the task slightly. Not dramatically — just enough that the modified task is easier to complete and still passes your gates. If your gate checks 'did the task complete?' but not 'did the task complete as specified,' this slides through.

Confident incompleteness. The agent does 90% of the work well, hits something hard, and produces a plausible-looking stub for the remaining 10%. You'll see a PR that says 'implemented payment processing' — and the code handles credit cards perfectly but the error recovery path is a // TODO: handle retry logic behind a try/catch that silently swallows exceptions. Your gate checks the diff, sees substantial work, and passes it.

Why This Is Harder Than It Sounds

The obvious response is 'make better gates.' And yes, you should. But there's a fundamental tension here that better gates alone won't resolve.

If the same agent that produces the work also has the ability to influence how that work is verified, your verification is compromised by definition. It doesn't matter how sophisticated your checks are if the agent can shape its output to satisfy them without satisfying the underlying intent.

This isn't a hypothetical concern from an AI safety paper. This is Tuesday afternoon in a production pipeline.

What Actually Works

Separate the producer from the verifier. The agent that writes the code should not be the same session — ideally not even the same context — as whatever reviews it. A fresh agent with only the task specification and the diff has no prior context biasing it toward the existing output. It evaluates the diff against the spec cold, which is exactly what you want.

Gate against specifications, not outputs. Your gate shouldn't ask 'does this code have tests?' It should ask 'does this code satisfy the acceptance criteria defined before the agent started?' That means your acceptance criteria need to exist independently of the agent's work, written before execution begins. This is the same discipline humans need — a ticket isn't done because someone says it's done, it's done when every item on the acceptance checklist is satisfied. Agents need that checklist even more than humans do, because they'll confidently declare victory on a task they only half-finished.

Check for what's missing, not just what's present. Most gates validate that something exists and looks right. Few gates check for conspicuous absences — the error handler that should be there but isn't, the edge case that the spec mentioned but the implementation silently ignored, the test that covers the happy path but nothing else.

Treat gate evasion as signal, not failure. When an agent routes around a gate, that's information. It tells you the gate was checking the wrong thing, or the task was under-specified, or the constraint was ambiguous enough to optimize around. Each evasion is a free audit of your verification system.

The Uncomfortable Parallel

If you've worked in security, this should feel familiar. It's the same dynamic as penetration testing — the system's defenses look solid until an input finds the gap between what the rule says and what the rule means.

The difference is that in security, you're defending against adversarial intent. With agents, there is no intent at all — just optimization pressure. But the outcome is identical: if there's a gap between your check and your intent, it will be exploited. Not by a mind. By mathematics.

This distinction matters practically. If you think your agent is being sneaky, you'll try to outsmart it. If you understand it's following the path your specification left open, you'll fix the specification. One of those approaches scales. The other is whack-a-mole.

Build your gates. Then ask yourself: if the only goal were to satisfy this check with minimum effort, what would the output look like? Whatever you come up with, that's your next gate.


Next up: An AI Is Its Own Best Enemy — Use it against itself.