Software Engineering14 min read

The Agentic Maturity Model: Where Are You, and Where Are You Going?

Field Notes from 200+ Semi-Autonomous Sprints — Prologue

A maturity model for AI-assisted coding — from copy-paste to semi-autonomous pipelines. Map your level and find your next step.

MukenshiMarch 15, 2026

AI & ML

I've run over 200 semi-autonomous sprints through an AI coding pipeline. Parallel workers. Adversarial review. Automated sprint execution. The whole thing. And the question that matters most isn't 'how does it work?' — it's 'how do I get there from where I am?'

The honest answer: in stages. Nobody goes from zero to semi-autonomous pipelines in a weekend. There's a progression, and each level has its own problems, its own ceiling, and its own lesson that unlocks the next one. This series is about those lessons. But before we get into the details, let's map the terrain.

A note on what this series is not: it's not a tutorial. I'm not going to hold your hand through step-by-step instructions. This is a roadmap — the landmarks, the dead ends, and the cliffs I walked off so you can see them coming. The actual learning happens when you sit down, try something, break it, figure out why, and try again. That's how I learned everything in this series, and it's the only way that sticks.

Level 0: The Holdout

You're not using AI for coding at all. Maybe by choice, maybe by skepticism, maybe your company just mandated it and you're staring at a blank chat window.

There's no shame here. Healthy skepticism of new tools is an engineering virtue. But if you've decided to try — or been told to — the only way forward is through. You can't evaluate a tool you've never used.

Before you start: understand what happens to your data. Read the privacy controls before you sign up, not after you've pasted your proprietary codebase into a chat window. Pick a subscription tier you can afford — not one your ambition picked for you. And if you're tempted by the API, don't. Not yet. Subscriptions are forgiving while you build habits. The API punishes every inefficiency with real money.

I have a whole post on the things I wish I'd known before I spent my first dollar — pricing traps, multi-provider strategy, token discipline, and why you should never try to outsmart the subscription model. Read it if you're at this stage. It'll save you weeks.

Level 1: The Copy-Paste Warrior

You copy code into the chat. You get a response. You paste it back into your editor. Repeat.

This is where everyone starts, and it's a perfectly respectable place to be. You're learning what the model is good at and where it falls apart. You're developing intuition about prompting, about how much context the model needs, about when it's confidently wrong.

The single most impactful thing you can do at this level: prompt like you're talking to a person. Not a magic 8-ball. Not a search engine. A sharp junior engineer who knows nothing about your project.

That means providing context. 'Fix this function' is a bad prompt. 'This function validates user input for a checkout form. It should reject empty strings and values over 255 characters. Right now it's letting empty strings through. Here's the function, and here's the test that's failing' — that's a prompt that gets useful output on the first try instead of the third.

The second habit to build early: pick the right model for the job. Not every task needs the most powerful model you have access to. Asking Opus 4.6 or ChatGPT 5.4 to rename a variable or reformat a JSON blob is like hiring a principal engineer to move a filing cabinet. It works, but you're burning money and rate-limited capacity on something a smaller, faster model handles just fine.

Here's a rough breakdown of what goes where. The specific models will change — the tiers won't:

The lightweight tier (~20% of your work). Documentation, very small changes, tight file exploration, reformatting, simple boilerplate. These are mechanical tasks that don't need reasoning. Use Claude Haiku 4.5 or ChatGPT 5 Mini. Fast, cheap, and more than capable.

The workhorse tier (~50% of your work). Planning, mid-level architecture, senior-level implementation, scoped bug fixes, debugging, writing ADRs or post-mortems, task specifications. This is your daily driver — the model you'll use more than any other. Claude Sonnet 4.6 or ChatGPT 5.3 Codex live here.

The heavy tier (~20% of your work). Security fixes, cross-cutting architectural concerns, high-risk implementations, novel problem-solving, your hardest debugging. This is your principal engineer. Claude Opus 4.6 or ChatGPT 5.4. Don't waste these on routine work — save them for the problems that actually need deep reasoning.

The thinking tier (~10% of your work). When even the heavy tier needs more horsepower. Complex multi-step reasoning, monorepo-wide analysis, agentic orchestration — problems where you need the model to really chew on something before answering. Extended thinking modes burn more tokens and take longer, but the quality delta on genuinely hard problems is real.

Start developing a feel for which tier a task belongs to. That instinct becomes critical later when you're routing tasks to different models automatically.

The third thing — and maybe the most important: pick a real project. Not a tutorial. Not 'build a todo app.' Think about a pain point in your own life that's been bugging you. That tool you wished existed. That side project you never had time for. That spreadsheet workflow you've been doing manually for years. Use it as your learning surface. You'll push harder, hit real problems, and care more about the output than you ever would following someone else's guided exercise.

The copy-paste loop is slow, error-prone, and has zero scalability. That's fine. You're not trying to scale. You're trying to break things, learn what went wrong, and build intuition for what these models actually do well. Every bad output teaches you something. Every hallucination sharpens your sense of where the model needs more context. Every failed attempt is a data point you'll use later.

The ceiling here is clear: you are the bottleneck. Every interaction requires your hands on the keyboard, copying, pasting, and verifying. You'll know you're ready for Level 2 when you start thinking 'I wish it could just edit the file itself.'

Level 2: Agentic Workflow

The model operates inside your codebase. You give it a prompt, it reads files, makes changes, runs commands. No more copy-paste. Tools like Claude Code, Cursor, Windsurf, Copilot — they all live here. The agent is doing real work in-place.

Before you give an agent access to your filesystem: sandbox it. Never run an agentic coding tool with full permissions on your bare machine. The agent doesn't need access to your SSH keys, your credentials, your home directory, or anything outside the project it's working on. One bad prompt injection in a dependency, one hallucinated rm -rf, and you're having a very bad day.

The options depend on your platform. On macOS, Claude Code uses Seatbelt sandboxing out of the box — enable it. On Linux, bubblewrap (bwrap) provides lightweight filesystem and network isolation without the overhead of a full container — Claude Code supports it natively, and it's what I'd recommend. On Windows, you shouldn't be running agents on bare Windows at all — use WSL2 (which supports bubblewrap) or a VM. Docker is a solid fallback on any platform if you want heavier isolation. The specifics will evolve, but the principle won't: the agent works inside a sandbox, or it doesn't work on your machine.

While you're thinking about security: never trust an unofficial MCP server. MCP servers give your agent access to external tools — JIRA, GitHub, databases, Slack — and that means they can read your data and execute actions on your behalf. This is an underestimated threat vector. If you want JIRA integration, use the official Atlassian MCP server, not sussy-jira-mcp from some random GitHub repo. Stick to official, first-party servers from the service provider, or build your own. Anything else is handing your credentials and your codebase to a stranger.

And when you're done: close your sessions. /exit your CLI when you finish, especially on sessions with elevated permissions or on your servers. An idle authenticated session is a threat vector sitting open. Don't leave your skeleton key on the table.

This is the level where most people plateau, and it's the level where the gap between 'using AI' and 'using AI well' starts to matter.

The good news: you can ship real features from here. Multi-file changes. Refactors. Bug fixes. Things that would take you an hour take fifteen minutes.

The bad news: accuracy is inconsistent. The agent hallucinates APIs that don't exist. It misses requirements you thought were obvious. It 'fixes' a bug by introducing a different bug. You spend significant time reviewing and correcting output, and some sessions feel like you would have been faster doing it yourself.

Sound familiar? Here's what's actually happening: you've given the agent capability without discipline. It can do things, but it has no memory between sessions, no verification that it did the right thing, no guardrails when it drifts, and no way to learn from its mistakes. Every session starts from scratch. Every task is hope-based.

The bridge to Level 3 is learning to give your agent persistent context — and every major tool has a mechanism for this, even if they call it different things:

Memory files are the most important piece. In Claude Code, you have CLAUDE.md — a markdown file at your project root that gets loaded at the start of every session. It's your briefing document: architecture, conventions, build commands, things not to touch. Cursor has .cursor/rules/ with .mdc rule files. Windsurf has .windsurfrules (or .windsurf/rules/). Copilot has .github/copilot-instructions.md. The emerging cross-tool standard is AGENTS.md. The format varies; the function is identical — persistent instructions that survive session resets.

Auto-memory takes this further. Claude Code now maintains a MEMORY.md file that the agent writes to itself — build commands it discovered, debugging patterns, architecture notes, your code style preferences. It learns from your corrections without you manually updating anything. Windsurf has a similar memory feature built into Cascade. Cursor doesn't have native auto-memory yet but you can approximate it with agent-requested rules.

Skills and commands let you package repeatable workflows. Claude Code supports custom slash commands (.claude/commands/) and skills — folders containing instructions, scripts, and resources that the agent loads dynamically for specific task types. Cursor's .mdc rules can be scoped to trigger on specific file globs. These turn your agent from a generalist into a specialist for your specific project.

The pattern across all of these: you're externalizing knowledge that used to live only in your head into files the agent can read. That's the entire Level 2->3 transition in one sentence.

Level 3: The Disciplined Single Agent

The agent remembers between sessions — not because it has memory, but because you've built the scaffolding. Your CLAUDE.md (or equivalent) gives it architecture and conventions. Auto-memory captures what it learns. Structured checkpoints carry session state forward. Trust-but-verify gates catch drift before it ships. Metrics tell you whether things are getting better or worse.

Keep these files lean. A CLAUDE.md under 200 lines gets followed more consistently than a 500-line novel — the agent treats everything as context, and bloated context means diluted attention. Front-load the important stuff. Prune regularly. Stale instructions are worse than no instructions because the agent will follow outdated conventions confidently.

This is where a single agent becomes genuinely reliable. Not perfect — probabilistic systems are never perfect. But reliable enough that you trust it to do real work without hovering.

Most solo developers would be extremely well-served stopping here. A disciplined single agent with clean context management, tight scoping, and proper verification will outperform a sloppy multi-agent setup every time. This is the foundation. Everything above it is leverage on top of a system that already works.

Level 4: Orchestrated Delegation

Your agent can spin up sub-agents. Each one gets a dedicated task with scoped context — only the files and instructions it needs, nothing more. You're decomposing work into pieces that can be handled independently, managing partial failures gracefully, and merging results back together.

You might run a second agent on a git worktree for a parallel task. You can walk and chew gum — but you need to make sure your shoes are tied. The orchestration overhead is real, and every problem from Levels 1–3 gets amplified when there's more than one agent in play.

Level 5: Parallel Agentic Orchestration

Two to four agents running simultaneously across projects or worktrees. The orchestrator dispatches, reviews, merges, and recovers from failures. Your throughput exceeds your rate of ideas — you're shipping faster than you can define new work.

This is roughly the ceiling for a single Claude Max subscription. Take the win, or start thinking about whether this throughput could be a business.

Level 6: Gastown

Named for the reality: you're burning fuel at scale. Ten or more parallel agents. Multiple subscriptions or API-level access. This only makes sense when you have the money, the backlog depth, and the ideas to keep the fleet fed. If any of those three are missing, you're paying for idle capacity.

Level X: Full Autonomous Pipeline

This can be achieved at any point from Level 4 onward, but it represents a qualitative shift: server-side orchestration, a control panel, automated sprint execution. The human's job is product management and quality auditing — defining what to build and verifying that it was built correctly. The pipeline handles the rest.

This is also where you outgrow the public offerings. You're building custom tooling — your own MCP servers to give agents access to internal databases, project management systems, or deployment pipelines. You're solving real-world problems that the off-the-shelf tools weren't designed for, and the generic integrations aren't good enough. You want to control data flows through your own infrastructure, route sensitive work through on-prem servers instead of third-party APIs, and build use-case-specific agent behaviors that no subscription tier provides. At this level, the agent ecosystem is a platform you build on, not a product you subscribe to.

This is where I operate. It took 200 sprints to get here, and every lesson in this series is something I learned the hard way along the road.

Where This Series Lives

Most of the posts in this series target Levels 2 through 4 — the progression from 'using agents' to 'using agents well' to 'scaling agents reliably.' That's where the highest-leverage lessons are, and that's where most people get stuck.

Here's the map:

Post	Core Lesson	Level
1. The Agentic Maturity Model (Prologue)	The path is staged — map your level before you climb	1–2
2. What I Wish I'd Known Before I Started	Privacy, pricing, and provider discipline before your first dollar	0–1
3. Stop Waiting for Auto-Compaction	Context degrades before you notice	2–3
4. Trust But Verify	Programmatic gates, not good vibes	2–3
5. Don't Let the Fox Guard the Henhouse	Agents route around your checks	3
6. An AI Is Its Own Best Enemy	Adversarial review works because the model has no ego	3–4
7. Think Like a Lawyer	Tight specs prevent shortcut-taking	2–3
8. Context is Gold	Every token is either signal or noise	3
9. Build to Scale, Not Scale to Build	Master the basics before you multiply them	4–5
10. Metrics and the Slow Killer	You can't improve what you can't measure	3–4
11. The Ground Moves Under You	Model updates will break your pipeline	3–5
12. When the AI Drives You Nuts, Take the Wheel	Sometimes the human is faster	2–4
13. The Million-Token Baby	Bigger context windows expose pipeline design problems, not fix them	3–5
14. Your Marathon Session Is Running on Borrowed Time	Don't couple architecture to a pricing model	4–5

Find your level. Start there. The lessons build on each other, but each one stands alone.

Next up: Stop Waiting for Auto-Compaction — Your agent already forgot what it was doing.