The web is a horrible data source

As I was working on new tools for my personal assistant this week, I ended up adding web search and thereby getting the content of a website to the toolbox. Automating anything web related in 2026 is a pretty annoying experience, and with good reason. It is understandable that website owners add scraping safeguards, considering the state of the web. Luckily, when you build tools to behave well, things are manageable.

There are two parts to adding "the web" as a data source to an LLM, no matter if personal assistant or coding agent. First you need to find the information you are looking for, then you need to get the data in a format your LLM likes.

I was considering using Kagis search API, now that it is available. I might actually start using it when I put my research agent together, but for now $12 for 1000 requests is a bit steep, especially as long as I do not know how wild my assistant and agent go when presented with the opportunity to slurp in infinite information. Exa was highly praised by two good friends of mine, and their pricing seems reasonable. But same as Kagi, I might hold out on this till I have a bit more time for in depth research how my tools behave when presented with the opportunity.

Luckily I was already hosting SearxNG. And while not on par with Kagi search for obscure search results it’s still pretty good. All I had to do was to enable the JSON endpoint and we are good to go with a simple HTTP GET request.

searchURL := fmt.Sprintf("%s/search?q=%s&format=json", s.baseURL, url.QueryEscape(si.Query))
resp, err := s.client.Get(searchURL)

Not too bad, but this only gives us the URLs we might be interested in. How do we get to the content? If your first suggestion is "just do another s.client.Get I can only respond with "oh, my sweet summer child", because these times are long gone. Between Cloudflares "are you human" checks, fox girls and single page applications it is not that easy anymore.

Luckily there are tools we use to test the web applications we build we can repurpose, such as Playwright. I know, I know, some might now rise their pitchforks and scream “Selenium!!” from the top of their lungs in support of W3C web standards. Fair. But I refuse to write a single sleep() statement and hope for the best ever again.

Playwright and a headless browser are a pretty neat and robust solution to get HTML from a website. But having the whole HTML site burns through tokens and adds a decent amount of noise. So when we are already using one tool from Microsoft, why not a second? MarkItDown does an excellent job converting various input formats to markdown. Incidentally a format LLMs really like!

Putting all of this together behind a FastAPI endpoint is all we need with these two libraries in place. The first version looked like this.

@app.post("/markdown")
async def get_markdown(request: URLRequest):
    browser_ws_url = os.environ.get("BROWSER_WS_URL")
    if not browser_ws_url:
        raise HTTPException(status_code=500, detail="BROWSER_WS_URL not set")

    response = Response(results=[])
    md = MarkItDown()

    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(browser_ws_url)

        try:
            for url in request.urls:
                res = Data(url=url, markdown="", error="")

                page = await browser.new_page()
                try:
                    await page.goto(url)
                    content = await page.content()
                    iob = io.BytesIO(content.encode("utf-8"))
                    result = md.convert_stream(iob)
                    res.markdown = result.text_content
                except Exception as e:
                    res.error = f"Error processing {url}: {str(e)}"
                finally:
                    await page.close()
                    response.results.append(res)
        finally:
            await browser.close()

    return response.json()

This is also the first Python code I use in LazerBunny. I do not feel like messing around with less tested libraries for this problem as I consider it an annoying chore to take care off, not a core feature I am actually interested in building out. All of this to allow Endirillia to look up when the next G2 game in MSI is.

Since the MVP I started adding a few API endpoints and persistent browser sessions so my coding agent can leverage the same, already deployed infrastructure to test changes made to web interfaces. Pretty neat two for one.

Day to day tools

There are two more tools which I can tell you will be used on a daily basis. One is to send wake on lan packages. I have my workstation and gaming / ML system mounted in a rack in the basement. To turn them on I usually went to Home Assistant and fired off a WoL package. Now I can tell Endirillia "wake up ml" and we are good to go. From a pure mouse click to result perspective this is quicker than opening HA. Once my voice agent is deployed it will be even faster.

The second tool is a Home Assistant integration. I have two Elgato keylights I use when on video conferences that need to be turned on. I also turn on or off the ceiling light depending on the current ambient light or time of the day. I have scenes in HA setup, but for Endi I opted to do simple toggles for the device entities. This will be useful in the long run when I use one command to not just set the scene but also mute Endi and toggle a few other automations.

When you think about giving your personal AI a slightly sassy personality think about it twice. Sounds charming, but even in models without guardrails removed this thing will roast you if you got the system prompt correct.

Progress

I am very much satisfied with the progress I made this week. I got some work on the coding assistant and its web UI in, but was mostly focused on starting to make the actual backend for Endi useful to a point where it is worth deploying the whole thing and starting to use it on a daily basis. There were some bug fixes for the calendar server and I am slowly getting to the point where Apple Calendar does not only just work, but also not constantly throw a tantrum about the smallest things that do not impact actual functionality.

For next week I mainly plan to finish the calendar server. Testing email invites is still on the todo list, but I also want to add the option to authenticate via OIDC. And most likely some rate limiting when login attempts fail. But once that is wrapped up I think it will be good time for a v1 release.

posted on July 5, 2026, 8:37 p.m. in AI, lazerbunny, web

I am perpetually a little bit annoyed by the state of software - projects constantly changing, being abandoned or adding features that make no sense for my use case - so I started writing small tools for myself which I use on a daily basis. And it has not only been fun, but also useful. For the rest of the year I will focus on a project I have been thinking about for a few years: Building a useful, personal AI assistant.