Explorer

Three Lessons from Building an AI Coding Toolkit

Nov 26, 2025

A couple weeks ago I wrote about an AI Coding Template I was working on. I’ve continued to tinker with it, and while it’s still very much an experiment, I’ve learned some things along the way that I think are worth sharing.

You can find the repo here: AI Toolkit (yes, I renamed it, and I’ll probably rename it again)

You can find the toy app I’ve been using as a testing ground here: So Quotable

The Quick Overview

The toolkit is a Claude Code plugin built around spec-driven development. The basic flow:

Project Brief - High-level overview of what you’re building (the elevator pitch).
ADRs (Architecture Decision Records) - Technical decisions that stay static once made (“we’re using Postgres,” “deploying to Vercel”).
Specs - Concrete plans for a body of work, similar to Epics, but with an emphasis on objective measure, not subjective, vague requirements. These are living documents that adapt as you implement.
Tasks - Individual pieces of work broken out from specs, each with a plan broken into phases.
Implementation - Execute phases with TDD workflow, code reviews, and worklogs.

Everything lives in markdown files in your repo. Claude reads them, follows the workflows, and updates them as work progresses. The README has the full details, but that’s the gist.

Now, onto what I’ve actually learned building this thing.

Lesson 1: LLMs Need Objective Acceptance Criteria, Not Subjective Ones

This was the big revelation from my vibe coding disaster. When I gave Claude vague instructions like “make the UI look better” or “improve performance,” I got wildly unpredictable results. Sometimes it would rewrite 90% of the codebase to “fix” something I never asked it to fix.

The solution was to write specs that an LLM can confidently evaluate as done or not done. Not “the login flow should be intuitive” but “the login form validates email format before submission and displays an error message within 200ms of invalid input.” Things that are measurable. Things where Claude can say “yes, this criteria has been met” without any ambiguity.

I’ve designed the toolkit around this idea. Specs follow a template that forces you to write concrete requirements with explicit acceptance criteria. It’s more work upfront, but it prevents the AI from going off on tangents or declaring victory prematurely. And Claude can assist you with this through a conversational workflow.

Lesson 2: Have the AI Review Its Own Work (But Not in the Way You’d Expect)

One of the weirder things I’ve found is that Claude is pretty bad at catching its own mistakes in the moment, but pretty good at catching them when wearing a different hat.

So the toolkit uses multiple specialized subagents. One writes the code, another reviews it. The code-reviewer subagent grades each phase, and Claude isn’t supposed to move on until it gets a passing score. It sounds silly, it’s still Claude talking to Claude, why would Claude catch a bug that Claude didn’t catch when Claude was writing the code… but it genuinely catches bugs. Something about the context switch forces it to look at the code fresh.

Combined with a strict TDD workflow (write the test first, write minimal code to pass, confirm tests pass, then commit), this produces surprisingly stable code. Not perfect, but way better than letting it just churn through features unchecked.

Lesson 3: Context Decay Is Real, and You Have to Fight It Explicitly

This is maybe the most annoying problem. Claude forgets things. Not just across conversations, but within long conversations, especially after compacting. I’d set up all these workflows and conventions, and then halfway through implementing a feature, Claude would just… stop following them. It wouldn’t run the code reviews. It wouldn’t update the worklog. It would start making feature branches off main instead of develop.

My solution was twofold. First, a WORKLOG.md file that Claude writes to after each subagent completes its assigned work. The next subagent reads the recent entries before starting work. Think of it like Jira comments, except in a structured format that the AI can parse (and the AI actually bothers to write them).

Second, a /refresh command that tells Claude to re-read all the workflows and conventions. Claude should be reading them automatically as part of the slash commands, but doesn’t always seem to do so consistently. I run it regularly, especially after conversation compacting. It hasn’t eliminated the problem, but it’s reduced it significantly.

Why Workflows Live in Files, Not Code

One design decision worth mentioning: the slash commands in the toolkit are intentionally minimal. They mostly just tell Claude to read a workflow file and follow it. The actual instructions; the git branching strategy, the TDD phases, the review criteria, live in markdown files in your repo.

Why? Because not every team works the same way. My default workflow uses a three-branch strategy (main, develop, feature branches), strict TDD, and code reviews after every phase. But maybe you want feature branches off main directly. Maybe you don’t care about the review scores. By keeping these as editable files rather than hardcoding them into the plugin, you can customize the workflow to fit how your team actually works. I wanted to go with a “sensible defaults” not “hard rules” methodology.

The downside is that this is probably partially why Claude sometimes ignores the workflows - they’re suggestions it reads, not rules baked into the commands. I could hardcode them into the commands for more reliability, but I’d lose the flexibility. It’s a tradeoff I’m still experimenting with.

What Still Doesn’t Work Well

To be clear, this is far from solved. A few things that still frustrate me:

Configuration and infrastructure. Getting Claude to set up Convex Auth correctly took way longer than doing it myself. Same with Vercel environments. There are some things it just can’t do autonomously, mostly around generating and copying environment variables between services.

UI work. It’s hard to describe a visual in words. I’ve spent a lot of time annotating screenshots trying to get Claude to adjust specific elements, and sometimes I just end up doing it manually.

The toolkit only works with Claude Code. I originally wanted this to be LLM-agnostic, but it’s deeply reliant on Claude Code’s subagent system. Maybe someday I’ll make a more universal version.

The Bottom Line

The underlying philosophy is simple: treat AI like a very fast but very literal junior developer. Give it clear specs. Make it write tests first. Have it review its own work. Remind it of the rules frequently. It’s not autonomous AI development - it’s structured collaboration.

The toolkit is still evolving (check the commit history if you want to see the chaos), but the core lessons have held up. If you try it out, I’d love to hear what works and what doesn’t.