Software Engineering5 min read

Build to Scale, Not Scale to Build

Field Notes from 200+ Semi-Autonomous Sprints — Part 7

Before you spin up a fleet of AI workers, ask whether you should. Most teams are bottlenecked on specification, not execution.

MukenshiMarch 23, 2026

AI & ML

Everyone wants the multi-agent pipeline. Parallel workers. Autonomous sprints. Fifty PRs a day. I get it — I built one. But the question nobody asks before spinning up their fleet of AI workers is the one that matters most: should you?

The Backlog Test

Before you think about scaling, look at your backlog. Not the aspirational roadmap. Not the 'someday' list. The actual, well-defined, acceptance-criteria-ready work that's waiting to be done.

If your backlog doesn't have enough well-specified work to keep multiple agents busy, you don't have a throughput problem. You have an ideas problem, or a specification problem, or a prioritization problem. Throwing more agents at an underdefined backlog doesn't produce more output — it produces more chaos. Agents need clear tasks. Vague tasks at high parallelism just means you're burning tokens on wandering at scale.

The honest test: can you write ten independent, well-scoped stories right now that could be worked in parallel without stepping on each other? If not, you're not ready to parallelize. And that's fine. Scaling isn't a badge of honor. Shipping is.

Most solo developers and small teams are bottlenecked on specification and prioritization, not execution. An agent that can ship ten PRs a day is useless if you can only define three stories a week. Fix the constraint that's actually limiting you before you buy capacity for a constraint that isn't.

Scale Is a Dial, Not a Switch

Here's something nobody shows you in the 'look at my 56-PR day' screenshots: the days where the backlog is empty and one agent is plenty.

Throughput demand is bursty. Some weeks you're staring at a stacked backlog after a planning session and four parallel workers can barely keep up. Other weeks you've shipped everything, the next milestone is still being scoped, and spinning up multiple agents would just be paying for idle capacity to look impressive.

Scaling up is half the skill. Scaling back down is the other half. The pipeline that ran four workers yesterday should run one today if that's what the work demands. Running parallel agents on an empty backlog isn't throughput — it's vanity. And it's expensive vanity, because real parallelism costs real money: more API calls, higher-tier subscriptions, more compute.

Match capacity to demand. If you are hesitant to pay more money to scale up for a particular burst, that hesitation is signal — either the expected return doesn't justify the cost, or you're not confident enough in your pipeline to trust it at higher volume. Both are valid reasons to stay where you are. But the reverse is also true: when the backlog is stacked and the work is well-specified, that's when you turn the dial up and spend aggressively.

The worst version of this is locking yourself into a fixed capacity — four workers, always, regardless of workload. You'll oscillate between 'not enough agents' and 'paying for agents with nothing to do.' Treat scale as elastic. The 56-PR day is a peak, not a baseline. Most days aren't that, and your spending shouldn't pretend otherwise.

Master the Basics First

A single agent running well — with clean context management, proper checkpointing, tight scoping, logic gates, adversarial review — will outperform three agents running on a shaky foundation. Every problem I've discussed in this series gets amplified by scale. Context waste multiplies. Gate bypasses multiply. Wandering multiplies. Stale checkpoints multiply.

Build the system that works at one. Prove it at one. Encode the conventions, the gates, the context discipline. Then scale. The scaling itself is the easy part once the foundation holds.

Best Effort Is the Only Honest Contract

Here's something you need to accept before you run parallel agents: they will not have a 100% success rate. The best models in the world score around 80% on SWE-bench Verified — curated, single-file Python tasks with known solutions. On SWE-bench Pro, which uses multi-file, multi-language tasks that models haven't seen in training, that drops to roughly 46%. In one study of real-world enterprise tasks, the best agent succeeded on its first attempt just 24% of the time. When tested for consistency across eight runs, the top score was 13.4%.

Those are benchmarks, not production pipelines — your results will vary based on task complexity, scoping quality, and how much orchestration you build around the raw model. But the direction is clear: agents are probabilistic systems, and designing for guaranteed success is designing for disappointment.

Every delegation to a sub-agent should be treated as best-effort. The work must be structured so that partial results are salvageable. If an agent completes 80% of a task and fails on the remaining 20%, you need to be able to merge the 80% and pick up the rest — not throw the whole thing away and retry from zero. Tasks should produce incremental, committable work — not a single monolithic output that's all-or-nothing. Your orchestration needs to handle partial completion gracefully — detect what landed, route what didn't to a retry or a human, and move on.

The teams that struggle with scaling are invariably the ones that designed for 100% success rates. When the first agent fails — and it will — their entire workflow stalls because nothing downstream can handle the gap. The teams that scale successfully are the ones that built for the actual distribution of outcomes: mostly successful, sometimes not, always recoverable.

The Real Flex

It's tempting to measure your pipeline by peak capacity. How many agents can you run? How many PRs per day? Those numbers feel impressive on a slide or a tweet.

But the real metric is reliability at whatever scale you're operating. One agent that ships clean, verified, well-scoped work every single time is more valuable than five agents that produce a mix of good output and subtle garbage that takes human time to sort through.

Build to scale. Don't scale to build.

Next up: Metrics and the Slow Killer — Why your pipeline is decaying and you can't feel it.