MCPs and questionable vibes

I spent some time during the last week to see how far I can push LLMs to generate some basic application functionality while working on Lazerbunny. I wanted to stick closely to the plan so I started with the timer application, as it is the most simple one. The scope is small, well defined and it would not take much to simply write around 200 lines of code to get it done, so it seemed like a good candidate to let an LLM handle the work.

I am doing this exercise every six months to see how far I can push an LLM to actually be useful. "Usefulness" is in my opinion only the beginning of the discussion if we should let agents run free and pump out code. Ethics (training data acquisition), law (copyright, data sharing) and environmental impact (water and energy consumption) are all things to also talk about and take into consideration, and are from my perspective conveniently ignored by LLM advocates. But the basis to have these conversations is - in my opinion - to establish if the technology can actually be useful in the first place.

All of the test were done using Zed or OpenCode. I ran the same task with three different models: Gemini, Qwen3-coder-next and Qwen2.5-coder. To summarize the task: Build an application to hold multiple timers in memory. Do not allow for duplicated timer names. Timers have a duration. Create a web interface. Create an API. Add Model Context Protocol (MCP) support. For Zed I gave instructions step by step. For OpenCode I let it run the planning phase and refined it until the plan looked good and then issued the build command. I specified the technology too: Go, vanilla CSS and vanilla JavaScript via Agents.md. This is only a rough summary, the actual steps were obviously a bit more involved and elaborate.

The Good

All models and ways to create the app led to something that kind of worked. Yay. Now you might ask why I accept "kind of" as good. It did save me some time I would have otherwise spent throwing together some boilerplate and writing frontend JavaScript and CSS. This is the part that worked really well.

Guiding the LLM through the implementation using Zed led to significantly better code, as well as a better code layout and architecture, but also involved more work and me being present in front of the computer.

The Bad

Not all code actually worked. Sometimes I needed to manually test the app and tell the LLM what to fix. Sometimes it figured it out and sometimes it spiraled so badly it deleted everything, rewrote it and made the same mistake again.

Refactoring was particularly bad. Qwen3-coder-next embedded the whole web interface in a constant. Instructing it to move the web interface to a separate html file made it run in circles for 40 minutes before giving up. Changing a function across the web UI and API was a major struggle for Gemini.

The ugly

It was very easy to notice when training material used for the model is older than the current library in use. This is true for all three models. They especially struggled with the MCP part using the MCP Go sdk or some parts of Pydantic AI where the API clearly changed between training and now. This usually led to completely broken code and no indicator what is wrong. All models tried to resolve the issue and only wasted time getting nowhere.

Test cases are fun. Qwen3-coder-next wrote the most elaborate test suite I have seen in a long time. Actually so many tests you barely could change a message without half of it failing.

Overall the code produced by all three LLMs was not even close that I would publish it, let's not even talk about shipping to production. Too much, too many unnecessary indirects, random function calls doing nothing, over architected or all just crammed into one file.

Vibes are still off

The way I currently use LLMs is fairly easy to describe: Research. I kick off a few questions how to do x, y and z and come back some time later to a summary and links to relevant material. I quickly read up on it to see if the provided code is sound and copy and paste it or rewrite parts of it. This works surprisingly well for me and saves me some time throughout the week.

If I would start adopting an agentic workflow like I tested this week, the bad and the ugly would need to drastically improve. Especially if I would be using a paid model and could not run infinite tokens locally. My comparison usually is an intern or part time junior dev. Depending on where you are located or where you are hiring you can get a way higher quality output for a comparable price from an actual human.

With all the shortcomings I have observed I do not see enough value changing the way I currently use LLMs and I do not see the benefits of adopting a fully agentic workflow. Especially because I will not compromise on the quality of the work I deliver. For one off experiments with nothing on the line (like a service handling a few timers) it is okay and saves me some time.

As a side note: I ran a few tests with more obscure problems like implementing a custom SPS (UDP) protocol and all I can say is that if you ever want to see an LLM fail spectacularly... give it a task that cannot be found on GitHub.

MCP

I have implemented the MCP server in Go for the timer service and a client in Python using Pydantic AI. Things are working okay most of the time, but there have certainly been a few learnings. Keep in mind this is one of the most simple setups possible. Process prompt, call a tool that is literally named after what is in the prompt, let a sub-agent handle the MCP calls.

You better specify the input and output schema if you do not want to have a really bad time. You might think the MCP can infer the types. Do not trust it to do that. Type the whole schema. Despite a typed schema one in 50 calls still tried to pass the duration as a string instead of an integer.

thinking models like to "think". And they "think" a lot. So much actually that they sometimes get really confused and loop tool calling. I could prompt around calling the tool multiple times, but then the agent stopped at a successful tools/list call and never set the timer. If the stop signal was not well defined enough one timer ended up to be 3 timers.

Llama 3.1 seems more robust at tool calling than LFM2.5-thinking. Also Llama 3.1 - no matter the prompt - has the "personality" of a potato. On the other hand LFM2.5 behaves as if it was trained on TikTok. Seriously, how many emojis do you need to tell me a timer was set?!

Adding "do not say a task was completed if you do not find a tool to do it" does mean the LLM will lie to you and tell you a timer is set without ever calling a tool.

Multi-agent patterns in Pydantic AI feel kind of clunky, I really do not like the syntax having a separate tool on the delegating agent defined calling a sub-agent. There has to be a far better way, even if it's as simple as abstracting the setup for sub-agents to be the same as for tools. The sub-agent pattern is really handy when running different models / LLM providers in the same app.

Pydantic AI being able to generate a simple chat interface with two lines of code is immensely helpful to start testing your agent and iterating on it. It also shows how much an agent can struggle calling a single tool.

Progress

Well, I can now set timers via chat interface... most of the time. So far Siri is still a bit more reliable, so I did not meet the minimum bar I am aiming for. But I can set a timer on my phone and stop it on my computer, so I got that going for Endirillia. Most of the plumbing is done, OpenCode is slowly but surely failing to build a Home Assistant integration in background, but I can start focusing more on building out the LLM piece and then adding more services.

I'm still working on the 3D avatar too. Earlier in the week we had a close friend over who studied art, including 3D modeling, and is running her own studio for concept art, story telling and character design. She told me I have a good starting point and do not seem hopelessly incompetent, which has to account for something, right? After all, I am one third through the first episode of the 28 part tutorial. But that is also creating the head, which seems to be the most complex and by far longest session. And that in only 20 hours!

Here is the thing, I never expected to be fast or good at this, especially not first try. Art never was one of my favorite subjects and I literally once handed in a pieces of paper with black paint on it titled "eagle at night with no stars or moon". So considering my history with artistic endeavors I am very pleased with the progress. Especially after she showed me how subsurface something something magically makes my vertexes look like a face! Can only get better from here.

posted on Feb. 22, 2026, 8:24 p.m. in AI, lazerbunny, software engineering

I am perpetually a little bit annoyed by the state of software - projects constantly changing, being abandoned or adding features that make no sense for my use case - so I started writing small tools for myself which I use on a daily basis. And it has not only been fun, but also useful. For the rest of the year I will focus on a project I have been thinking about for a few years: Building a useful, personal AI assistant.