Vibecoding with LLMs: A personal experience
Although I am not a huge fan of the vibecoding term itself, I understand how come it appeared and why it is so popular.
Below I will give you some of my own recountings of working with Claude Code, Github Copilot and Opencode.
Models used
Claude
- Claude Opus 4.5 - Used mostly for planning and high-value tasks. Consumes tokens too fast
- Claude Sonnet 4.5 - Used sparingly mostly for surgical stuff. Pretty ok as a everyday model. Also used it for planning. Decent
- Claude Haiku 4.5 - Excellent and fast for most tasks. Loses context a lot and will
- Google Gemini 3 Pro - A lot of context errors. Used it with antigravity. Google's IDE. Decent but runs out fast.
- Google Gemini 3 Flash - Ugh.. iffy. Don't like.
- Google Gemini 2.5 Pro - Useless.
OpenAI
- Raptor mini (In essence this is a over-adaptation of GPT-5 mini specialized by Microsoft to work with Visual Studio's Github Copilot) - Did most of the work with raptor. Hit and miss. Fast, bad at planning, loses context, and is usually bad at understanding nuance. Though understanding here is an improper word for a LLM.
- Grok Code Fast 1 - Spent very little time, but liked it. Doesn't use visual reference so its mostly usable when doing smaller tasks.
- GPT-5 mini - Meh. I don’t know. More of the same as Raptor mini
Working with code
I spent a lot of time working together with Claude. It's one of the most annoying models you can think of. Constantly eats up a lot of tokens with the incessant summaries. It is generating a lot of garbage, useless documents that seem very intelligent in detailing, but are not really that readable.
Up to a point it can do a decent job of scaffolding code. And it will hold context.
However past 25k tokens, and around 500 or so lines of code, it will fray at the edges, since the context is of a 25k context tokens. This means you have to plan out code to be chunked into smaller bite-sized pieces of max 500 lines of code.
All of the models across suffer greatly here. They cannot handle context past a certain point, and I was forced to put them to do continuous refactors to avoid the above pitfalls. In some cases I wrote the code myself because it would take considerably less time than having to write an entire novel in detail just to explain the nuances of what they did wrong.
Things I liked
- I will keep GitHub copilot just because I absolutely love how fast it is in helping me auto-correct code. This is such a huge savior for me. Plus it seems to scan code to understand what hints to give.
- Great at extracting inline styles to separate classes. This is such a chore to do by hand that I feel outsourcing this to an agent/llm makes so much sense
- When looking for a combination of technical stack, it does a good job of explaining strengths and weaknesses, provided you know ahead of time what they are about.
- Great for searching across the project and finding individual issues.
Things I hated or still hate
- It's absolutely shit at remembering what it did. Mostly because there is no memory. Each analysis is a gamble of weights more than a actual analysis.
- It mimics thought but it does not think. It just approximates intent/cause based on learned patterns
- It's not reliable in outputting production code unsupervised even with all the skills and the agent improvements in the world
- When it refactors, 99% of the time, it introduces new bugs, loses context and duplicates features. A human developer would extract information by hand, and then piece it all together. LLMs are in a continuous rush to finish. Similar to a junior.
- It's shit at thoroughness. Whatever it does, it does so surface level. The fact people believe the models behave differently based on who is using them is naiveté. They are all pretty much the same
- Some days Claude feels like a genius that huffed jet fuel, others, like the dumbest kid on the block. This goes back to the earlier statement related to how it works on a fundamental level (weight based approximation)
- What is most frustrating is how much you need to write and describe bugs in order to ensure the AI fixes what is needed.
Planning
Most models do a pretty good job of planning, but by far the best one seems to be done by Claude Opus. At least as far as I noticed.
However, there is a tiny note here. Personal preference inclines more to Google's own IDE, Antigravity, which has implemented a really natural way of planning, that allows users to leave comments and notes for the LLM to ingest after the planning phase.
This seem rather natural and makes more sense than having to stop-write-stop-review the plan several times over. It's just too tedious.
Like with coding, the LLMs have a natural tendency to want to jump into implementation, overeagerly trying to start coding. I hate this part. They do ask if you have some follow up questions or if you want to auto-accept and start, but I feel it's way too easy and alluring to want to jump into implementation
Planning mode makes a lot of sense whatever you're doing because it allows you to mull over the architecture ahead of time. Ask a lot of followup questions and add notes making sure you add as much context as possible before the implementation phase begins
Execution
In terms of code quality? I am not a certified expert, since I am still a UX designer at heart. But I can tell when I see spaghetti code.
For the most part, all LLMs LOVE to do spaghetti code. They live for that shit! If you're not being careful, you'll end up with a huge pile of combobulated classes and functions you need to comb over to make sense for any human
Sometimes it does a good job. Like yesterday when I asked it to refactor current code and port my grazy figma plugin from vanilla Javascript to React-based + VITE and Rollup.
But even then, it managed to lose SO MUCH CONTEXT, that it took me a whole day to go through all of the bugs, handholding it, and telling it where it fucked up and how to fix this based on the old sources.
So I ended up with a mixed codebase just to make sure the core functionality was ported over from the original thing. And it was incredibly taxing.
After a round of vibe coding, where you are actually present and engaging with the LLM, you'll feel drained after having had to handhold it and nanny it to make sure you still have something to return to in the end.
Results
I for one can\t say that LLMs are entirely useless in this case. Au contraire!
But they need a huge amount of guidance, handholding, and information to pull off a good end result JUST from LLM magic
However if you do happen to know a bit of code, OOP and some programming principles, you should fare much much better than the average individual.
To my personal case:
- I built 2 functional digital products from scratch during Christmas time
- Started work on 2 others (which i parked for now)
- A mini-saas companion app is in the works
- A project management tool is also being worked on
The future
I don't think humans will be the bottleneck, rather the token-based lotto system most AI companies have implemented will. I feel this is by design to ensure some level of addiction and embedding into organizations in order to ensure they become irreplaceable.
Developer recruitment will likely resume by the end of this year, once most companies sober up