TTS adventures

Another week working on LazerBunny and things are progressing. Not as expected but that is fine. I spent some time with text to speech this week, generating voices and trying to clone one. It has been a rollercoaster, not just thanks to the tooling.

Before we jump into details let me say: The amount of software, scripts and examples published by AI providers and companies that is outright broken is hilarious. Simple things, like requirements colliding and preventing uv sync to produce any form of workable environment, to plainly broken code you have to fix for it to run. Even things any linter would find, like variables being used that were never declared.

But enough with the complaining. I ran three separate tests this week. I actually hoped to give you the results of all three, but since my 4090 is busy rendering Resident Evil Requiem this will have to wait till next week.

I am using Qwen3-TTS. (Something that according to Qwen 3.5 does not exist. Guess when the training data cut-off for 3.5 was.) Right now I am only using the 1.7B model Qwen3-TTS-12Hz-1.7B. I assume the 0.6B will be sufficient when I actually start using it, but I want to see what the big model can do first and directly compare it to the smaller one.

First attempt was fairly straight forward, one shot voice cloning with a five second clip. That was a complete failure. Intonation did barely exist, the voice was off and it did not sound at all like the reference clip. Even generating the exact same sentence of the reference did not improve things.

Moved on to four clips at 20 seconds and things looked a lot better. I borrowed some voice lines from a video game character my wife and me both know and generated some clips. When I played them she immediately mentioned "it sounds like this character". Well, spot on. Still not as decent as expected, but a significant improvement.

Right now I have a fine-tune running on my workstation with about 10 minutes of voice recordings. This is on the lower end of what should be working, but it should be sufficient to give me an idea how well this will work.

Overall using a CPU instead of a GPU was fine for everything, except for the fine-tune process. Inference certainly would not be too bad running on a Ryzen 5900x.

Results are pending how good a fine tune will be. Freely generating a voice via prompt and saving the prompts for future reuse so the voice does not drift is "okay" and good enough if you just want something that sounds better than say on a Mac.

I am planning to compare this with a few commercial models to see if I am missing out on something, but last I looked a few months ago I do not think the difference will be ground breaking.

Voice Actor

For the final steps of the project I am actually considering hiring a proper voice actor (VA) to at least record the usual phrases like acknowledging that something was understood, reacting to a wake word and so on. As I mentioned in the original post this is kind of a passion project and homage to a character I spent a lot of time with.

From a first look voice actors come in every price range as you would expect for a creative profession. Some professional, some with amazing references and some who seem to be doing it as a hobby. Curiously enough price per words does not seem to be too closely related to past projects, experience and references.

The part that will be a bit sensitive to talk about is that I hope to fine-tune Qwen on these recordings for the sentences not recorded. With a voice assistant there is no way to pre-record everything needed. The training would happen in house, neither the recordings nor the model would be used for commercial purposes or to produce any output I would share except maybe a small demo when everything is done.

But with the current state of the world and the way creative professionals are treated I fully expect people to not be very receptive of this idea. And I cannot and will not hold this against anyone. I will also not run a fine-tune like this without their permission. I strongly believe that even if I pay for their work, I pay for the recording and to use the recording, this is not a free ticket to do whatever I want with it.

I am also very much aware that if you would ask me wether running Linux on a PlayStation or macOS on an iPad should be at my discretion, the answer would be very different considering what I think should be allowed to do with what I purchased. The big difference here is I am not buying an electronic device. I am paying for something a human creates with a very personal part of them, their voice.

Is it absolutely necessary to hire a VA? No, of course not. But even with the best models I have heard the gap to an actual human putting emotion in a vocal piece and understanding the emotions and situation is such a big difference it is most likely - pending final price negotiations - worth it to me for this specific project.

A small step for a competent 3D artist…

… a big leap for an creatively inexperienced software engineer. It has been round about 30 days since I started working on the avatar for Endirillia. I would assume I put an hour or so a day in. There have been three or four days when I did not work on the project, but I also did a bit more work over the weekends. So I would assume about 30 hours of work for what was a four hour tutorial.

There are some small adjustments I think I want to make but I will leave them for the final revision when the rest of the body is also done. I was told it makes more sense to do these when seeing the greater picture. Hair and eye color are also not yet done. Once I break the mirror I have setup to make working on the face easier I might go for heterochromia.

Next up is the base for the body and I think this will be tricky, as I will have to move away from the tutorial a bit further than while working on the base for the head.

While in World of Warcraft Blood Elves are already a thin line in the scenery, the tutorial I am following that is supposed to be more oriented towards the Pixar style for characters kind of drifts to an even thinner line but after spending money on silicon - not really the vibe I’m going for. I might not hit all edits I want to do first try and might need some help to get it to the final shape later on, but progress will look... "interesting".

Progress

Between finishing the head for the avatar and fixing scripts in Qwen3-TTS repository to fine tune models while also looking for potential voice actresses with an portfolio that suggests they might match the vibe and voice of Endirillia like I imagine there was not much other progress to be made.

What became very clear once more is that I need to take a look at the architecture again. I ran some benchmarks to validate my assumption from last week that the 16GB MacMini might not be too happy doing everything at once and that I will need a larger model than expected seem to be true. So back to the drawing board how I will split up the services.

posted on March 15, 2026, 10:58 p.m. in AI, lazerbunny, project

I am perpetually a little bit annoyed by the state of software - projects constantly changing, being abandoned or adding features that make no sense for my use case - so I started writing small tools for myself which I use on a daily basis. And it has not only been fun, but also useful. For the rest of the year I will focus on a project I have been thinking about for a few years: Building a useful, personal AI assistant.