Mission Impossible - using a tool

This weeks work on LazerBunny was mostly focused on building out two agents and wiring everything up to an MCP. Turns out an LLM, by virtue of being a little box full of hallucinations, is not the most reliable to follow very simple orders. And it shows when trying to work on structured output.

It was a rather slow week. I had to deal with a concussion and there were some unexpected fires at one of my clients to take care of, so private projects had to take a step back.

To be able to interact with agents as I build them, I span up Pydantic AIs web interface. So far there are two functions built out. One is countdowns / timers which is a Go application exposing functionality via MCP. The other is a simple agent using a single tool to get the weather from Deutscher Wetterdienst. Both faced essentially the same problem: Dealing with tool outputs. I am using Pydantic AI and opted for structured output to make it easier to pass data around. The amount of times I have seen

Error: Exceeded maximum retries (1) for output validation

or a similar error (that literally means the exact same thing) are too many to count. Print statements? Why would they ever work. Helpful error messages? Not a chance. I event got to the point that I joked on Mastodon that the only reason Pydantic AI exists is to sell Logfire to make this thing remotely debuggable. You can obviously work your way around it. But I have not used a single language or framework the past 20 years that was so unhelpful during development.

I usually accept whatever a language or framework throws at me and work with it. But there are points early in a project where a question such as "is this the right tooling for what I want to do" can and has to be asked, especially when running into issues with basic functionality - such as debugging why an output does not validate properly, despite everything being typed to exact the same values.

"Fun" fact: Llama sometimes just decides to add markdown to strings or treat integers as strings. Because why not. Qwen 2.5 and LFM 2.5 are far better in that regard, yet Llama is considerably faster calling tools.

There are not many frameworks that can compete with Pydantic AI from a functionality perspective and having dealt with LangChain and others in the past, I do not think things will be significantly better switching frameworks or stacks. But it is also a bit disheartening to be honest. It means I am basically stuck with it and have to start building tooling around it.

There is actually an argument to be made to only fall back to an agentic workflow when a regex to parse easy commands (such as "set a timer for 20 minutes") does not match. I rarely change how I give simple instructions to an AI, so this would at least be a mitigation. But it does not solve the underlying problem.

Calling tools

I still am not sure if an LLM is considerably more useless to call a function than I expected, or far smarter than I would hope for. Assume a regular tool to return the weather at my home location. def weather():. Nothing fancy. But do you think the LLM would just call the tool if I ask for the current weather? No, of course not. Especially not when explicitly instructed to not do anything but call the appropriate tool.

I obviously need to specify a location for the where I want the weather forecast. Despite no latitude / longitude arguments accepted or mentioned for the tool. Sometimes the tool was called with a random lat / lon, usually Shanghai. Please do not ask my why Shanghai. I have no idea.

Sometimes - thanks to a debugger I can say for sure - the call was successful, but there was simply no data returned.

The current weather at your location (approximately 51.0, 6.9 latitude and longitude) is as follows:
• Temperature: [Temperature]
• Conditions: [Conditions]
• Humidity: [Humidity]%
Please replace the placeholders with the actual data from the API response. It's currently [Weather Description], with a temperature of [Temperature] degrees Celsius and humidity at [Humidity]%.
Would you like more details or information about a specific area?

Yep. Why use the values returned by the weather API when you can do that. Makes sense. Ask the same request a second time and behold: Weather!

Based on the latest data, the current weather at your location is:
• Temperature: 15°C
• Conditions: Partly Cloudy
• Humidity: 70%
It's currently partly cloudy with a temperature of 15 degrees Celsius and humidity at 70%.

Another fun issue was a tool being called multiple times because exit conditions were not honoured. Which exit condition, you may ask. The one written out in plain English, because why should I be able to have a proper return signal that something was successful. So either a timer was set properly. Or 10 times. So beside being more explicit with structured output and instructions, I limited the amount of calls to the MCP to two. Kind of works but feels ugly. Two is actually the right amount because an agent will first make a call to list all tools and then to call a tool.

Progress

I had successful runs with "set timer, list timer, stop timer". And I got a weather forecast. So I am mostly on feature parity with what Siri can reliably do. But the process to get there was not pleasant. And I have to admit I currently do not trust this to turn into a mostly reliable system. I am saying mostly because there is still a little hallucination box in the middle that could make things up at any point in time.

There was no progress on the avatar. As mentioned above more work than expected and a concussion are not very helpful when learning 3D modelling. Hopefully I can return to Blender next week.

posted on March 1, 2026, 4:54 p.m. in AI, lazerbunny, software engineering

I am perpetually a little bit annoyed by the state of software - projects constantly changing, being abandoned or adding features that make no sense for my use case - so I started writing small tools for myself which I use on a daily basis. And it has not only been fun, but also useful. For the rest of the year I will focus on a project I have been thinking about for a few years: Building a useful, personal AI assistant.