I Tried Using AI to Make Me an AI App
Jul 15, 2025
Well, it’s been a while since my last post. I bought a trailer, spent some time fixing it up, took it out in the woods, and now I am up in Alaska visiting family. Just spent a long weekend hiking around Denali National Park. But I’ve also been working on a little side project, and as an experiment, I decided to jump on the bandwagon and see how far I could get with just “vibe coding.” It ultimately didn’t work, but I learned a lot along the way.
The Experiment: Can I “Vibe Code” an Entirely Functional and Real App?
I started by downloading Cursor and just giving it pretty high-level, vague instructions, just to see if I could get a proof of concept for my app idea up and running. It was able to do that really quickly. Too quickly, perhaps, as it gave me a false sense of security.
It worked great at first. I created tasks, kept a PRD.md
file updated, and fed it environment variables, but I wrote no actual code myself. As I asked it to add more features, however, the codebase got bigger, and the problems started. It would completely break an already working part of the app or make massive UI changes I never asked for.
Where It All Fell Apart
Again, this worked great at first, but more and more problems crept in over time, including:
- The Context Window Problem: The core issue is that LLMs don’t truly “remember” anything. They’re great at doing one thing but terrible at considering the entire codebase. Left to their own devices, they will write code that completely breaks the entire rest of your app.
- The TDD Problem: The LLM’s attempts at Test-Driven Development often led to infinite loops and bizarre, cascading failures. Sometimes it would implement something that would cause some other tests to fail, but to fix those failing tests it wouldn’t really take into consideration the thing that had broken them so much as it would just start rewriting entire sections of code to fix the failing test, which could sometimes lead to a domino effect where other tests started to break, etc. One time I asked it to make a relatively small change, and by the time it got done fixing the series of breaking tests it had updated over 90% of the files in the repo and the app had a completely different UI. Another time it got in a feedback loop where it would fix Test A, which would break Test B, so it would fix Test B, which would break Test A, and it basically got itself stuck in an infinite loop.
- The “Terminal Glitch” and Other Quirks: Over time, Cursor started to have more issues with any commands that required using the terminal. It would get hung up, try a few times, then give me some message like “I am having persistent terminal issues, can you run these commands for me and paste in the output” or something like that. Usually just reloading the window or the IDE fixes it, but it was happening all the time making it pretty unbearable.
- Verbose and Tightly-Coupled Code: The LLM consistently seemed to write more code than was necessary to accomplish a given task, despite me giving it explicit instructions to “make a test, write the least amount of code needed to pass the test, confirm the test passes, commit.” It makes very verbose and also very tightly coupled code, which led to the unintentional regressions.
- Bad at Infra and Configs: It had a helluva time getting my app to deploy to Vercel. It got it eventually, but it took a lot longer than it would have to just configure manually. Same issue with the Docker containers and local environment.
- Wildly overconfident: You’ll tell it that something is broken, it will write some code, the very confidently claim that your issue is completely solved and ready for Production! You then test it yourself and find that is, in fact, not solved. Even with a TDD style of troubleshooting, it usually takes half a dozen or so iterations of the LLm telling you the problem has been solved before it’s actually solved.
- It also seems to be fairly bad at logic. At tracing the user’s journey through multiple parts of the app and making sure everything works and flows correctly. The LLM is very smart but very narrowly focused at any time.
Basically, I hit the wall of pure “vibe coding” pretty quickly and had to take ever-increasing control to try and work around that.
TDD: When the codebase first started getting messy, my first thought was to enforce discipline through Test-Driven Development. I gave the LLM a strict, three-step directive for every feature: write a failing test, write the minimum code to pass, and confirm all tests are green. This was a partial success, acting as a safety net but also leading to the bizarre feedback loops I mentioned.
So Many Rules: Since TDD alone wasn’t enough, I tried to constrain the LLM’s logic by feeding it an ever-growing list of rules like “ensure loose coupling” and “follow SOLID principles.” This helped, but didn’t always work as the LLM would often “forget” these rules mid-task. You would think that the LLM would always know to include these rules, but I found myself having to add things like “remember to follow our TDD workflow” in my prompts so it wouldn’t forget.
Helper Scripts: I had the LLM create its own tools. I instructed it to write simple shell scripts for its own use, like a start-task.sh
script to automatically create and check out a new Git branch.
Structured Task Files: My initial task-list.md
didn’t provide enough context. I abandoned it and moved to a system where every task became its own detailed Markdown file (e.g., TASK-001-Implement-User-Auth.md
), complete with a user story, acceptance criteria, and technical notes. This gave the LLM a rich, self-contained document for each job, drastically improving its focus.
The Point of No Return
Even with these increasingly sophisticated (some might say convoluted) guardrails, I eventually hit the wall where letting the LLM just do its thing was no longer possible. The problem was that the foundation of the app, built during the early “pure vibe” phase, was fundamentally flawed. So many poor architectural decisions had been made, and so much tech debt had accumulated, that every new feature caused a cascade of regressions. Lots of patches were made and holes plugged, but it was becoming a bit of a nightmare.
So my experiment to see if I could launch a complete app, one that was actually a viable source of side income, without writing any actual code, was a failure. But it’s not all bad. It did validate my proof of concept. I asked the LLM to write a “lessons learned” doc, to update the PRD, and then archive the entire codebase.
V2
Here’s what I’m doing differently now:
- I am still using Cursor on the $20/mo plan and the $100/mo Claude Code Max tier inside a terminal in Cursor.
- I am heavily enforcing TDD and reviewing all of the tests the LLM writes before it actually starts the implementation, and in many cases tweaking them or writing them myself as needed.
- I am experimenting with using Linear for task management. I know that, given my background I should just be using Jira, but Linear has been getting a lot of buzz lately, so this was a good excuse to try it out. I am writing comprehensive User Stories for every task and having the LLMs work off of those AND update them with comments about their progress.
- I am having the LLM present me with a comprehensive plan, after it reads the Linear issue, for me to sign off on before it begins.
- I am also using the Context7 MCP and instructing the LLM to always check the latest documentation before it makes its plan.
- All work is done on feature branches and I am doing complete, manual code reviews before those branches are merged in.
- I had the LLM help me design a very comprehensive Local/Staging/Production CI/CD pipeline that we always follow.
- I am also seeing how I can bring the Gemini CLI into the mix. So far I have found the code that it produces to be much worse than Claude Code’s, and it has some weird quirks like it can’t make multi-line commits for some reason, so I’m not sure of the best place for it to fit into my workflow. I’m hoping I can find a way to use it to do some lighter work so I don’t keep hitting my usage limits for the $100/mo Claude Code Max tier and have to rely solely on Cursor’s Auto mode. Maybe it can be my documentation writer or something.
This new process is working much better. I’m still developing V2 of the app far faster than I could have alone. It would have taken me months to get to this point manually; now, I’m hoping to have a beta ready within a week or so.
The takeaway is clear: AI won’t be developing complex apps autonomously anytime soon. But it will absolutely make senior developers more productive. The future of software engineering is going to be less about knowing the syntax of a language or being an expert on a framework and more about being a good software architect, someone who can write excellent specs, design robust tests, and guide an incredibly powerful, if sometimes flawed, tool.