Software Engineering8 min read

Testing the hype -- real work tells a different story

Name: Billion Bucks
Author: Joe Poehlein

Four local models, eight hosted models, two modest coding tasks, and the same uncomfortable result: useful drafts, but no autonomous worker.

Joe PoehleinMay 23, 2026

The API bill was thirty-one cents. Paying the babysitter was the problem.

I ran four local open-weight models and eight hosted open-weight models through two small coding tasks that looked like the kind of work a coding agent should be able to handle: one realistic engineering change and one mechanical cleanup. The tasks were not toy prompts, but they also were not the hardest kind of software work. They had clear instructions, existing nearby patterns, and a narrow success condition.

None of the models produced a result I would have accepted as autonomous work.

The Test

The first task was a realistic route-loading change: split a few eagerly loaded pages into lazy route/page chunks to reduce an oversized JavaScript bundle, while preserving the behavior of the affected pages.

The second task was a mechanical JSON parsing cleanup: a helper had been swallowing malformed JSON and returning an empty result. The requested behavior was to let malformed JSON fail loudly, while still returning an empty result for valid JSON that simply lacked optional context.

Each model got the same shape of work: understand the task, propose a plan, produce a patch, then perform one self-review/fix pass. The patches were not applied automatically. A stronger parent reviewer inspected the output for correctness, patch usability, and whether the model had actually satisfied the task.

The Judge

The judge for the run was GPT 5.5 High Fast lane. It acted as the parent reviewer: reading each attempt, checking whether the patch matched the task, and deciding whether the output was usable as completed work or only as material for a stronger reviewer to salvage.

The Local Run

The first leg used local open-weight models running through LM Studio on a Windows 11 workstation, accessed from WSL2. The hardware was an RTX 4090, an Intel i9-14900K, and 96 GB of RAM.

qwen/qwen3.6-27b
google/gemma-4-31b
zai-org/glm-4.7-flash
qwen/qwen3.6-35b-a3b

The local leg mattered because it tested the dream version of the economics: own the runtime, pay no per-token API bill, and use local inference for bounded software work. The results did not clear that bar.

Model	Route-loading task	JSON cleanup task	Throughput signal	Short read
`qwen/qwen3.6-27b`	Fail	Pass with fixups	~43 tok/s	Partial framework knowledge, unsafe placeholder repair, and patch grammar failure.
`google/gemma-4-31b`	Fail	Pass	~10-15 tok/s	Best route-pattern recognition locally, but still missed the route-loading acceptance bar.
`zai-org/glm-4.7-flash`	Fail	Pass with fixups	~110-115 tok/s	Extremely fast, but patch output was unreliable and the fix round emitted invalid code.
`qwen/qwen3.6-35b-a3b`	Fail	Pass	~42-52 tok/s	The mixture-of-experts shape did not materially change the worker-lane verdict.

The pattern was consistent: local models could be useful patch assistants, especially for tightly specified mechanical edits, but each failed the realistic route-loading task. The failures were not just speed problems. They were judgment and execution problems: unsafe placeholders, incomplete acceptance, invalid patch output, or fixups that required a stronger parent to catch.

The Hosted Run

The hosted models were:

qwen/qwen3.7-max
qwen/qwen3-coder-next
z-ai/glm-5.1
mistralai/devstral-2512
moonshotai/kimi-k2-thinking
deepseek/deepseek-v4-pro
deepseek/deepseek-v4-flash
tencent/hy3-preview

Across the full run, the harness made 64 API calls with zero provider errors. The models consumed 108,811 input tokens and produced 87,618 output tokens. Estimated spend was $0.3164.

The Hosted Scorecard

Model	Route-loading task	JSON cleanup task	Estimated spend	Short read
`qwen/qwen3.7-max`	Fail	Pass with fixups	$0.1804	Verbose and often directionally useful, but patch transport and verification were not good enough.
`qwen/qwen3-coder-next`	Fail	Pass with fixups	$0.0044	Cheap and fast, but invented file shapes and emitted unusable patch structure.
`z-ai/glm-5.1`	Fail	Pass with fixups	$0.0349	Useful on the mechanical task, but produced empty or non-actionable output for the realistic task.
`mistralai/devstral-2512`	Fail	Partial with fixups	$0.0136	Fast and inexpensive, but accepted an insufficient route-loading approach and weakened error behavior.
`moonshotai/kimi-k2-thinking`	Fail	Fail	$0.0601	Showed some self-correction on the realistic task, then failed to deliver usable patches.
`deepseek/deepseek-v4-pro`	Fail	Fail	$0.0156	Successful API calls, but empty assistant content in critical patch/fix phases.
`deepseek/deepseek-v4-flash`	Fail	Pass with fixups	$0.0031	The strongest late mechanical result, but the realistic patch used an unverified route pattern.
`tencent/hy3-preview`	Fail	Fail	$0.0042	Successful API calls with empty assistant content in key phases.

The headline is not that every output was useless. Several were useful as raw material. The headline is that none of the local or hosted candidates met the bar for unsupervised completion.

What Failed

The failures clustered around four themes.

First, patch transport was fragile. Some models described the right change but wrapped it in fenced pseudo-patches, malformed diffs, JSON-shaped instructions, or other formats that could not be applied directly.

Second, framework pattern matching was shaky. The route-loading task depended on recognizing the project's routing conventions and choosing the right lazy-loading mechanism. Several models reached for plausible React patterns that were not the right integration point for the code in front of them.

Third, self-review was too trusting. Models often approved their own output after checking for the easy part of the task, while missing the actual acceptance condition. A patch that moves code around is not the same thing as proving the oversized bundle warning is resolved.

Fourth, a few runs produced empty assistant content despite successful API calls and nonzero token usage. From an orchestration perspective, that is a hard failure mode: the provider was reliable, but the worker still did not return an actionable artifact.

The Cheap Part Was Real

The economics were genuinely impressive. The local models had the obvious appeal of no per-token hosted bill. The hosted run then added 64 API calls, no provider errors, and under thirty-two cents of estimated spend. That is a strong result for inference access. It means these models are cheap enough to use experimentally, to ask for second opinions, or to generate candidate diffs without worrying about a large bill.

That part of the hype is not imaginary. Hosted open-weight APIs can be reliable, fast enough for workflow use, and extremely inexpensive.

The Expensive Part Moved

But cheap tokens are not the same thing as cheap labor.

The expensive part moved from inference to supervision. A stronger parent still had to read the patch, normalize the format, reject invented patterns, check the real code path, apply the change manually when it was salvageable, run verification, and decide whether the result actually satisfied the task.

That is not a small footnote. If every model attempt requires a stronger reviewer to perform the real engineering judgment afterward, the workflow may still be useful, but it is not autonomous labor. It is assisted drafting.

What This Does Not Prove

This was a narrow spike, not a universal benchmark. It does not prove that open-weight models are bad at coding. It does not prove that these models cannot succeed with better prompts, more examples, tighter patch schemas, tool access, or a different harness.

It does show something practical: on two modest software tasks, using a parent-reviewed worker setup, the local and hosted open-weight candidates did not reliably cross the line from helpful suggestion to completed work.

The Practical Takeaway

I would still use these models. I just would not assign them ownership of this kind of work yet.

The best current use is as cheap parallel drafting: ask for plans, ask for candidate diffs, ask for alternate explanations, ask for a second reviewer on a narrow function. Then have a stronger parent system or human engineer review, normalize, apply, and verify the result.

That can be valuable. It is just a different value proposition than autonomous software labor.

The lesson from this run is simple: hosted open-weight tokens can be very cheap. The work is only cheap if the output survives contact with the codebase without needing a more capable engineer to redo the hard parts.