The Million-Token Baby
Field Notes from 200+ Semi-Autonomous Sprints — Bonus Post
Anthropic markets the 1M token context window as the greatest thing since sliced bread. They are not wrong. But the hidden costs will eat you alive if you are not paying attention.
Anthropic just made the 1M token context window free for everyone on Opus 4.6 and Sonnet 4.6 — whether you wanted it or not.
But if you are running an agentic coding pipeline, the marketing is hiding three costs that will eat your budget, your rate limits, and your output quality. I learned this the hard way when I had to recalibrate my pipeline's health indicators from a 200K window to a 1M window — and the practical thresholds barely moved.
The Drag You Cannot See
Every turn in a conversation, the model re-reads the entire context. Not a summary. Not a diff. The whole thing. At turn 1, that is your system prompt plus your first message. By turn 20, that is every tool call, every response, every file read, every error log — all of it, re-ingested before the model produces a single new token.
This means your token consumption is not linear with conversation length. It is roughly quadratic. Here is what that looks like concretely:
You start a session at 40K tokens (system prompt, CLAUDE.md, bootstrap context). Each turn adds roughly 30K of new content (tool calls, responses, file reads). Here is what happens to your cumulative input as the session grows:
These numbers are estimates. Your actual token consumption will vary based on system prompt size, tool usage patterns, output length, and how much context your workflow generates per turn. The shape of the curve is the point, not the exact dollar amounts.
| Turn | Context Size | Cumulative Input | Input Cost | Output Cost (Cumul.) | Session Total |
|---|---|---|---|---|---|
| 1 | 40K | 40K | $0.20 | $0.38 | $0.57 |
| 5 | 160K | 500K | $2.50 | $1.88 | $4.38 |
| 10 | 310K | 1.75M | $8.75 | $3.75 | $12.50 |
| 15 | 460K | 3.75M | $18.75 | $5.62 | $24.38 |
| 20 | 610K | 6.50M | $32.50 | $7.50 | $40.00 |
| 30 | 910K | 14.25M | $71.25 | $11.25 | $82.50 |
Assumes ~15K output tokens per turn (tool calls, code, explanations) at Opus 4.6 rates ($5/$25 per MTok). All costs are cumulative across the session.
Look at turn 1. Output cost ($0.38) is nearly double the input cost ($0.20) — because output tokens are 5x more expensive per token. At $25/MTok, even modest output adds up fast. But as the session grows, input drag overtakes it because the entire context is re-read every turn while output stays roughly constant per turn. By turn 30, input accounts for 86% of the total bill.
That is the trap: output cost is where the per-token rate hurts, input cost is where the volume hurts. Both will get you.
Not all tokens cost the same. Here is what you are actually paying on the API:
| Token Type | Opus 4.6 Rate | What It Covers |
|---|---|---|
| Input | $5 / MTok | System prompt, conversation history, tool outputs. Re-sent in full every turn. |
| Output | $25 / MTok | Model responses, tool calls, reasoning. 5x more expensive than input. |
| Cache Write | $6.25 / MTok | First store of cacheable content (1.25x input, 5-min TTL). |
| Cache Read | $0.50 / MTok | Hits on cached prefix (0.1x input — 90% savings). |
Prompt caching softens the blow on the prefix — the static front of your context (system prompt, CLAUDE.md, project bootstrap) that does not change between turns. A cache hit on that portion saves 90%. But the conversation history that grows every turn cannot be cached. In a long session, the uncacheable portion grows while the cacheable portion stays fixed. The ratio gets worse every turn.
On a subscription you do not pay per token, but you have rate limits. The drag burns through your allocation proportionally faster. If you have been hitting limits more since the 1M update and your first instinct is to blame Anthropic — stop. Look inward. The 1M window did not introduce a new cost. It amplified bad token hygiene that was already there. The window got bigger. Your habits did not. That is not Anthropic's problem.
One more: if you spin up Haiku as a sub-agent and feed it context from a parent session running at 400K+, you are pouring a swimming pool into a bucket. Haiku's ceiling is smaller. It will immediately auto-compact — not because anything is broken, but because you did not right-size your context before delegation. The 1M window makes this easier to hit because the parent accumulates more before anyone notices.
Time to First Token
The drag is not just financial. It is temporal. Every turn, the model processes the full context before generating its first output token. At 200K context, that is fast. At 500K, you notice the pause. At 800K, you are waiting. At 1M, you are watching a spinner and wondering if the session froze.
For interactive agentic work — where the human is waiting for the agent to respond, or where a sub-agent is blocking on an orchestrator — this latency compounds. A pipeline that felt snappy at 150K context feels sluggish at 400K, not because the model got dumber but because every response now requires ingesting a small novel before it can start thinking.
Quality Still Degrades
This is the one that gets buried in the marketing. A bigger context window does not fix attention degradation. The research is clear:
- Chroma's Context Rot study shows LLM performance degrades substantially as context fills
- Adobe Research's NoLiMa benchmark found most models drop below 50% of their short-context baseline by 32K tokens
- The lost-in-the-middle problem — where instructions in the middle of a large context get less attention than those at the beginning or end — gets worse with more context, not better, because there is more middle to get lost in
Having 968K more tokens of runway does not help if the model stopped paying close attention to your constraints at 50K. You have not bought more effective context. You have bought more runway to accumulate noise.
The Sweet Spot Did Not Move
When the 1M window shipped, I had to recalibrate the status indicators in my pipeline — the color-coded thresholds that tell me whether a session's context consumption is healthy (green), getting heavy (yellow), approaching checkpoint territory (orange), or past the point of useful work (red).
The window went from 200K to 1M. The thresholds moved, but not by the factor you would expect:
| Zone | % of 1M | Token Range | What Happens |
|---|---|---|---|
| Green | < 30% | < 300K | Normal operation. No intervention. |
| Yellow | 30-50% | 300-500K | Checkpoint nudges. ~2% quality loss per 100K. Cost per turn climbing. |
| Orange | 50-65% | 500-650K | Blocks new work. Quality and cost both degrading fast. |
| Red | > 65% | > 650K | Hard stop every turn. Checkpoint and start fresh. |
| Auto-compaction | ~83.5% | ~835K | If you hit this, you blew past three warning zones without stopping. |
Orange hits at 50-65% and blocks new work. Red is anything past 65%, and the answer at red is the same as it always was: checkpoint and start fresh. Auto-compaction does not fire until about 83.5% — around 835K — which means if you have been running long enough to trigger it, you ignored every signal the pipeline gave you.
The extra 700K tokens are not usable runway for agentic work. They are a buffer for the occasional large file read or a safety margin against unexpected context growth. They are not an invitation to let sessions run five times longer.
What the Marketing Does Not Say
Anthropic removed the >200K surcharge. That is a real, good thing — it removes a pricing cliff that punished developers for crossing an arbitrary threshold. But removing the surcharge also removes the warning signal. Before, hitting 200K meant your costs doubled and you noticed. Now, you sail past 200K, past 300K, past 500K, and nothing in the pricing tells you to stop. The costs are still there — cumulative drag, TTFT degradation, quality loss — they are just silent.
The 1M context window is a genuine capability for the right use cases. Analyzing an entire codebase in one shot. Processing a book-length document. Ingesting a full specification before answering questions about it. These are single-pass tasks where the full context is consumed once, and the bigger window is exactly what you want.
But for multi-turn agentic work — the kind this entire series is about — the practical ceiling has not changed. Manage your context. Checkpoint before you hit 50%. Start fresh sessions. The window got bigger. The agent did not get smarter at using it.
This is a companion post to the Field Notes series. Start from the beginning: The Agentic Maturity Model — Where Are You, and Where Are You Going?