Automating The Implement Loop

Jan 16, 2026

If you’ve been following along, you know I’ve been on a bit of a journey with AI-assisted coding. It started with pure vibe coding (disaster), evolved into a structured template, and eventually became the AI Toolkit with specs, phases, and code reviews. Even applying it to more of a top level Meta Repo.

It’s a pretty simple workflow all things considered. You make a spec, which has tasks in it. You plan each one of those tasks, which breaks it up into concrete phases, and then you have the AI loop on implementing the phase. Most of the parts are single commands, easy enough, but the implement loop was always the really important part.

For most phases I want a TDD. So, one subagent writes the tests, a second writes the code to pass the tests, and a third reviews the code. If the reviewer gives it a passing grade you repeat for the next phase, if not it gets sent back to the developer agent to fix. All in an automated loop. At least that’s the theory. The problem is that, due to the nature of an LLM, instructions are more suggestions than hard rules. Suggestions that can be ignored or forgotten about, especially as the conversation goes on longer and the context fills up.

The Idea

But what do you need to do when you need to run soemthing on a loop in a predictable way? You script it of course. So I’ve been working on a pretty simple Python CLI to do just that. It’s mostly basic Python with a little bit of LangChain thrown in, relying on Ollama, but what it does is define some agents and then has them run through a workflow in a truly looped, constrained manner.

Introducing ollama-chat (probably going to work on that name at some point).

Right now the structure is pretty simple

  1. Architect Agent - Reads the spec, creates a PLAN.md, breaks it into phases, defines acceptance criteria for each
  2. Developer Agent - Implements each phase
  3. Reviewer Agent - Scores the code against the acceptance criteria
  4. Loop - If score < 90, Developer gets feedback and tries again (up to 5 attempts)

This isn’t a new idea. It’s basically what I was doing manually with the AI Toolkit, and what dozens of other people are doing in one form or another, I just codified into an actual program instead of a series of slash commands I run myself.

The Local Model Question

Of course the natural question is, if you’re using Ollama, can’t you use multiple, and even local, models? Use a frontier model like Opus to plan things out, then a cheap local model to actually implement it? And the answer is yes. I’ve only been working on this for the last few hours, but so far the concept is sound, at least.

First Results

I tested this on a simple,very basic vanilla JavaScript todo app. Detailed spec with exact hex colors, pixel values, the whole nine yards.

  • Architect: Claude (via Claude Code CLI with headless calls) - creates the phased plan
  • Developer: Qwen 2.5 Coder 7B (local) - implements each phase
  • Reviewer: Qwen 2.5 Coder 7B (local) - scores against acceptance criteria

It… mostly worked?

The app got built. It looked decent. But when I tried to add a task, nothing happened. The Developer had written a perfectly good addTask() function but never wired it up to the Add button. The Reviewer passed it because the acceptance criteria said “Add button exists” and “addTask function works,” not “clicking the Add button adds a task.”

The fix was trivial - literally just adding the event listeners:

document.getElementById('addButton').addEventListener('click', () => {
    addTask(document.getElementById('taskInput').value);
});

That’s it. The app worked perfectly after that. All the logic was there, the localStorage persistence worked, the styling was correct. It just needed someone to connect the UI to the code. A human would have caught this immediately; the automated reviewer had no way to know. It was ugly but it worked.

Classic garbage-in-garbage-out. The workflow worked exactly as designed; my acceptance criteria just weren’t specific enough.

I tried it again with Qwen 2.5 Coder 14B with much better results. And I even tried it using Sonnet 4.5 as the Developer and Reviewer, again through headless calls, and it worked even better. So literally what I was trying to get my /implement Claude Code skill to reliably do, all in a few lines of Python.

Again, still very early, but you can see the results of these tests here if you care: ollama-chat-test.

Lessons So Far

Acceptance criteria need to be stupidly specific. This isn’t new - I wrote about this in my AI Toolkit post - but automation makes it even more critical. When a human is in the loop, you catch the obvious stuff. When it’s fully automated, “the button exists” and “the button works when clicked” are very different requirements.

The Architect prompt matters a lot. My first version had the Architect abstracting away specific values. The spec said “#2563eb” but the acceptance criteria said “has a blue background.” So the Developer used green. I updated the Architect prompt to explicitly preserve literal values from the spec, and the next run got the colors right.

Local model size vs. VRAM is a real constraint. The 7B model runs comfortably on my 4070 Ti Super (16GB VRAM) at 30-50 tokens/second. The 14B model was a different story. My first attempt showed 15.2GB VRAM usage, and checking ollama ps revealed the model was running 55% CPU / 45% GPU. What should have been “twice as slow” was actually ten times slower - the loop took nearly two hours.

After a reboot with nothing else running, the same 14B model loaded at 9.7GB and ran 100% on GPU. So you need to be certain to completely offload one model before switching to another. Something like this seems to work:

curl http://localhost:11434/api/generate -d '{"model": "current-model", "keep_alive": 0}'

This can save you a reboot. This is testing building a simple, greenfield project. I don’t know how well this would do on a large codebase, a lot more context to stuff into that 16GB, so need to do some testing there, but so far it’s promising.

Full file rewrites are inefficient. Right now, the Developer outputs complete file contents each phase, even if it’s only adding a few lines. For a small todo app this is fine, but for larger files it wastes tokens and risks unintentional changes to already-correct code. A future version should have the Developer read existing files and output targeted edits instead.

Model Comparison

After running the same spec through multiple models, the results were clear:

(Caveat: this isn’t a scientific comparison. Each run regenerates the plan via the Architect, so phases and acceptance criteria varied between runs. The local models use Ollama’s API while the Claude tests use the Claude Code CLI in headless mode - different execution paths entirely. A true apples-to-apples test would reuse the same plan and normalize the API layer. Take these results as directional, not definitive.)

Model Time Event Wiring Accessibility Error Handling Code Quality
Qwen 2.5 Coder 7B ~15 min ❌ Missing Partial Basic Good
Qwen 2.5 Coder 14B <10 min ✅ Correct Good Basic Better
DeepSeek Coder V2 16B Fast ✅ Correct None Basic Fair
Claude Sonnet Fastest ✅ Correct Full ARIA + keyboard Comprehensive Excellent

Local Models

Qwen 2.5 Coder (7B vs 14B): The 14B consistently produced more complete code than the 7B. It wired up event listeners correctly, included keyboard accessibility, and produced cleaner file separation. The catch is VRAM hygiene - on 16GB, you need to ensure clean GPU state before running the 14B. Check ollama ps and restart ollama if you see any CPU/GPU split.

DeepSeek Coder V2 16B: Fast inference (10GB VRAM, 100% GPU), and it got the event wiring correct. But the code quality was noticeably lower than Qwen 14B - closer to the 7B level. No file separation (inline styles), no ARIA attributes, uses innerHTML with template literals (XSS risk), confirm() dialogs for delete, array indices instead of unique IDs. It’s “tutorial JavaScript” - works, but you’d want to clean it up. If you need speed over polish for throwaway prototypes, it’s an option.

Sonnet: A Different League

The gap between Sonnet and the local models was bigger than I expected. Sonnet produced genuinely production-quality code:

  • Full ARIA attributes (role, aria-checked, aria-label)
  • Proper isLocalStorageAvailable() check with graceful error handling
  • Dedicated error message UI element
  • Event delegation (cleaner than per-item listeners)
  • Focus management (auto-focus input on page load)
  • Multiple responsive breakpoints (640px and 320px)
  • A TESTING.md documentation file (unprompted!)

The local models produced “it works” code. Sonnet produced “ship it” code. For prototyping and learning, the local models are fine. For anything going to production, the quality gap is hard to ignore.

I Know I’m Not The Only One

I realize I haven’t stumbled on some revolutionary idea here. There are plent of CLI based coding assistants out there, and more popping up every day. OpenCode, Aider, Kilo Code, they’re all doing some variation of this. But I think there’s a lot of value in had in building your own. My solution probably will never be as good as some commercial product or really popular open source project, but what I will learn by building it out is invaluable.

What’s Next

This is very much a work in progress. I want to:

  • Run more complex specs to see where it breaks down
  • Add some kind of execution/testing step to catch “code exists but doesn’t work” bugs
  • Experiment with MCP integration for automated testing (Playwright, etc.)
  • Try this workflow on a real codebase, not just toy apps

Would You Like to Know More?