Metrics and the Slow Killer
Field Notes from 200+ Semi-Autonomous Sprints — Part 8
Your pipeline is decaying 5% per sprint and you can't feel it. Four metrics that catch the slow killer before it catches you.
Your pipeline is working. Stories ship. PRs merge. And somewhere in the back of your brain, a voice whispers: 'I bet I could make this 15% more efficient.'
That voice will break your pipeline. I guarantee it.
The Temptation That Kills
You've internalized the failure modes. You've read a series like this one. You want to optimize — tighten the context management, refine the model routing, squeeze out wasted tokens. This is a healthy instinct. It's also the most dangerous moment in your pipeline's life.
Because the thing about a working pipeline is that it's working. It might not be elegant. It might be burning tokens you could save. But it's producing output. Reliably. Every day. The moment you start tuning it in place, that reliability is at risk — and you won't know it's gone until something that used to work doesn't, and you're not sure which of the four changes you made this morning caused it.
Agent pipelines are not traditional software. You can't write a unit test that covers the full behavior of a probabilistic system. A prompt change that improves one task category might degrade another. A context optimization that saves tokens in the common case might remove information the agent needed for an edge case you haven't seen in two weeks. The feedback loops are long, the failure modes are subtle, and the blast radius of a 'small change' is unpredictable.
And here's what nobody talks about: breaking your own pipeline is genuinely upsetting. You had something that worked. You touched it because you wanted better. Now it's broken and every minute debugging is a minute not shipping. The emotional response is to rush the fix — revert everything, or worse, pile another change on top of the broken one. This is how pipelines accumulate scar tissue. Layers of changes and counter-changes that nobody fully understands, each added under pressure.
The fix is boring: test your changes somewhere that isn't your production pipeline. Route a few new tasks through the tuned version before you promote it. Did they complete? Did token consumption change? Did the agent behave differently in ways you didn't intend? If you're eyeballing one or two outputs and saying 'looks good,' you're not testing — you're hoping.
You Cannot Tune by Feel
Here's the uncomfortable truth that connects the tuning problem to something deeper: you cannot evaluate pipeline changes by feel. Your intuition about whether the output 'seems good' is unreliable, especially for subtle degradations that accumulate over time.
Your pipeline was completing 96% of stories last month. This month it's 91%. Next month it'll be 86%. Each individual sprint feels roughly the same — 'mostly good, a couple of hiccups.' You never notice the trend because each drop is within the noise of daily variation. Six months later you're wondering why your pipeline feels unreliable, and the answer is that it's been getting worse for months and you had no instrument telling you.
You need numbers. Not a data science project. Not Grafana dashboards on day one. A handful of metrics, tracked consistently, that tell you whether things are getting better or worse.
The Metrics That Matter
Completion rate. Of the stories assigned to a sprint, what percentage were completed to acceptance criteria? This is your headline number. Everything else is diagnostic. Be honest about what 'completed' means — a story that technically merged but required significant human cleanup isn't completed, it's salvaged. Inflating this metric defeats the purpose.
First-attempt success rate. Of tasks delegated to agents, what percentage succeeded without retry or human intervention? This is distinct from completion rate because retries mask problems. If your completion rate is 95% but your first-attempt rate is 60%, your pipeline is working — but it's working hard. A declining first-attempt rate is the earliest warning signal you'll get. It tells you something upstream has degraded before it shows up in your completion numbers.
Token consumption per story point. How many tokens does it cost, on average, to complete one story point of work? This captures everything — context waste, debugging rabbit holes, retries, wandering. When this number goes up without a corresponding increase in task complexity, something is wrong. Track it as a rolling average over ten or twenty tasks to smooth out the noise.
Human intervention rate. How often does a human need to step in? This tells you how autonomous your pipeline actually is versus how autonomous you think it is. Track what triggers interventions. If the same category keeps requiring human involvement, that's a targeted improvement opportunity. If interventions are random and varied, you might be at your natural autonomy ceiling for the current task complexity.
Using Them
Establish a baseline before you change anything. Run your pipeline as-is for at least five sprints and record every metric. Without this, you're guessing whether things improved.
Review weekly, not daily. Daily numbers are noisy. Weekly trends smooth out the variance. If a metric has been declining for three consecutive weeks, that's a real signal.
Correlate, don't isolate. A drop in completion rate paired with a rise in token consumption says 'agents are struggling and burning tokens trying.' A drop in completion rate with stable consumption says 'agents are failing fast on something specific.' Different diagnostic, different fix.
And this is where the tuning problem closes the loop: when you test a pipeline change, run the same tasks before and after and compare these numbers. If completion rate holds, first-attempt rate improves, and token consumption drops — you have a real optimization. If any metric regresses, you have more work to do.
The Slow Killer
Pipelines don't usually die dramatically. They decay. A little more token waste here. A slightly lower success rate there. One more human intervention per sprint than last month. Each one is ignorable in isolation. Together, they're a trend, and trends have momentum.
Metrics are how you see the trend before the trend sees you. Trust the numbers over your gut. Your future self will thank you the first time a declining metric catches a problem you would have missed.
Next up: The Ground Moves Under You — What happens when the model you built your pipeline on changes overnight.