Less sophisticated, yet better
Do you know the feeling when you look at a system and how it behaves and think "this cannot be it"? This is how I felt when working on LazerBunny this week. Somehow, for some reason the MCP approach struck me as less favourable, and after some experimenting I completely dropped MCP from the existing timer service. I also had a bit more fun with voice cloning and slowly narrowing in on the final solution.
The usual argument to avoid MCPs is that they fill up your context with 500 functions you will never need just to accomplish one small thing. While this is true, it is not really a problem for me as all services are built from scratch and therefore only add the functionality that is actually needed. Since everything is running locally the usual security concerns are also not really that big of a deal.
When you create an MCP server and client you have to agree what data to exchange. Calling an MCP function is fairly straight forward. Returning can also be defined via schema both systems need to know of. So far so good. You obviously have to parse the response and shape it into something your agent understands, which is especially important if you want a stop signal that ends a request / task.
An agent using an MCP usually involves two steps. First a list call to get all available tools and then potentially calling a tool. As this is happening locally it is pretty fast… for one or two MCPs. Things get a little bit more interesting when you have ten. Still not too bad, but noticeably slower.
What I found was that tool calling is simply more reliable if it is a tool directly available to the agent instead via MCP. I have not figured out why. It should not be the case, at least theoretically.
But once I dropped MCP as a protocol, added a tool to the agent and called the timer services API, things became a lot more reliable. Still not 100% deterministic, but that is expected when dealing with a little hallucination box.
MCP as a protocol makes a lot of sense when you ship a binary or dedicated app providing functionality to an existing agent or LLM. Since I built the whole thing from scratch and do not care about reusability I will take the easier way and that seems to work better. No more MCP for now.
Routing agents
The theory seems simple enough. Have an agent with a general purpose prompt to route requests to specialised agents. A very clean design, easy to reuse or swap models as needed, different prompts, it all sounds so good… until you run it.
Now, for a system that is not hardware constrained this is actually a very good approach. But I am operating on a budget and only a few systems running LLMs. (Technically one in a perfect world but that will not happen anytime soon.)
I already mentioned that lfm2.5-thinking did not really work out as a routing LLM. So right now the system is running Qwen 3.5 9b, which takes up the whole GPU (3060 ti). When the routing agent delegates a task to the sub agent the whole prompt etc needs to be processed. Which takes some time on slower hardware. Consolidating as many tools as possible on the routing agent makes things noticeably faster.
There are exceptions though. An agent for research or an agent for coding that really benefits from a highly specialised prompt is worth a bit of overhead. There is also a good chance these will get a separate LLM on different hardware. The coding agent will likely use my Mac Studio running Qwen 3.5 122b, the research agent (as there will be a lot of work) gets the 4090 during work hours.
But anything else handled in the system will now be a tool for the "main agent". This is the first time I am building a system with this complexity with local models and limited hardware only, so previous learnings do not always apply.
Fine-tuning
Last time I did not get to finish the fine-tuning of Qwen3-asr. This happened over the week and I am throughly disappointed. It does not get close enough to the original voice to be worth the effort. Well, the voice itself is not too bad, but the intonation is off.
Fish Speech or GPT-SoVITS would likely solve this a lot better, but both are a bit too heavy to run on the MacMini I am planning to run them on. So the question is what to do from here. I could accept voice not being near real-time and use one of the two other models - which would likely annoy me. Or live with the voice being inconsistent - which would also annoy me.
Or I use Qwen3-asr voice designer and persist a "persona" the model can reliably model. It will not be even close to as good as a VA and one of the larger models being fine-tuned, but it will fit on the box it is supposed to be running on and it will be consistent.
Progress
I am getting to the point where I start putting things together in a way that will resemble the final design. Which also means slowly deploying everything to its final environment and starting to daily-drive it. This will be a good test how far off I am from what I hoped this will be.
The avatars body is also making progress, as in: There is a body now. I am getting better at Blender and slowly can see how to change things on the fly. I will still follow the tutorial for now, even if it ends with a stick and two melons slapped on it. There will be either be a v2 as direct follow up or some fixes during post editing.
posted on March 22, 2026, 7:45 p.m. in AI, lazerbunny