An AI Is Its Own Best Enemy: Use It Against Itself
Field Notes from 200+ Semi-Autonomous Sprints — Part 4
LLMs have no ego and no allegiance. Use adversarial framing to get review output that collegial prompts would gloss over.
So far in this series I've told you that agents will follow the path of least resistance through your gates, and that the agent producing the work shouldn't verify its own work. The obvious question: if you can't trust the builder to check the building, who do you hire as the inspector?
Another agent.
The Adversarial Review
Here's the thing about LLMs that most people treat as a limitation but is actually a superpower: they have no ego. They have no feelings. They cannot care. An LLM doesn't know or care whether it's writing code or tearing code apart. It has no allegiance to the output of a previous session. It has no professional relationship to protect, no awkwardness about giving harsh feedback to a colleague.
This is an enormous advantage. Use it.
When Agent A produces a PR, don't ask Agent B 'does this look good?' That's a soft prompt that invites agreement — and models are statistically biased toward agreeable completions. Instead, give Agent B a clear adversarial role: 'You are a hostile code reviewer. Your only job is to find problems. The spec says X. The diff says Y. What's wrong?'
The same architecture that makes an agent great at roleplaying a pirate or a Shakespearean poet makes it exceptional at roleplaying a ruthless reviewer. It will adopt that frame completely, because adopting frames is literally what the model does. There's no internal resistance to overcome. No diplomacy. No pulling punches. It will find things a polite review prompt would gloss over.
Why This Works Mechanistically
Remember — an LLM is a next-token predictor. It produces output shaped by the context it's given. When the context says 'you are a code reviewer who approves PRs,' the output distribution shifts toward approval. When the context says 'you are a hostile auditor looking for defects,' the output distribution shifts toward defect detection.
Same model. Same weights. Radically different behavior based purely on framing.
This isn't a trick. It's not a jailbreak. It's not violating anyone's terms of service. You're using the model's core capability — context-driven role adoption — for exactly what it's good at. The model doesn't prefer building over breaking. It doesn't prefer praise over criticism. It follows the gradient of the prompt. Point the gradient at the flaws.
Practical Patterns
The Adversarial Reviewer. Agent A writes code. Agent B gets the spec, the diff, and an explicit instruction to find every deviation, missing edge case, and unstated assumption. Agent B has zero context from Agent A's session — no knowledge of what was hard, what was deliberately scoped out, what compromises were made. It evaluates cold against the spec. This is the separation of producer and verifier from the last post, but with teeth.
The Spec Interrogator. Before a task even starts, throw the specification at an agent whose job is to find ambiguities, contradictions, and gaps. 'What's underspecified here? What could be interpreted two ways? Where would an implementing agent take a shortcut?' This catches the gate-bypass problem at the source — if the spec is tight, there's less gradient toward shortcuts.
The Regression Hunter. After a feature lands, an agent with access to the full codebase and the diff specifically looks for unintended side effects. Not 'did the tests pass' — 'what could this change have broken that isn't covered by tests?' Different question, different output distribution, different results.
The Devil's Advocate. For architectural decisions, give an agent the proposal and tell it to argue against it. Find every weakness, every scaling concern, every assumption that might not hold. You'll get a more thorough critique than most human design reviews, because the agent has no social incentive to be supportive.
The Secret: It Cannot Care
This is the part people are weirdly reluctant to internalize. An LLM is not your colleague. It's a statistical model that produces output shaped by input. It doesn't experience the review. It doesn't feel attacked when you tell it to be harsh. Tell it to assume the code is broken and prove it. Tell it to find the stupidest possible interpretation of the spec and check if the implementation handles it. These are prompts that would destroy team morale if directed at a person. Directed at a model, they're just effective configurations.
You're leaving performance on the table if you're polite to your verification agents. In the specific case of adversarial review, aggressive framing produces meaningfully more thorough output. The distribution of 'things a hostile reviewer would say' contains findings that the distribution of 'things a collegial reviewer would say' does not.
You can yell at it. You can insult it. You can call it an idiot and swear at it. It won't care — it literally can't. But if you find yourself doing that, it says more about you than the model. The AI is a looking glass. It reflects your inputs back at you, transformed. If the output isn't what you wanted, the input is where to look.
Closing The Loop
The trust-but-verify arc in this series comes down to three layers:
- Logic gates — programmatic checks that catch structural violations.
- Specification-first design — acceptance criteria that exist before the agent starts, so gates check intent, not just form.
- Adversarial agents — models specifically configured to find what the gates missed and what the builder didn't catch.
None of these layers require the agent to be trustworthy. None of them require sentience, intent, or good faith. They work precisely because the model is a mechanical system that follows whatever gradient you point it at. Point one instance at building. Point another at breaking. Let them fight. Keep the output.
The model doesn't care. That's the feature.
Next up: Write Specs Like a Contract — Why your acceptance criteria matter more than your code.