Claude Code transcript — Voice Evals Ideation

Context: Up until this point in the session, I was working on reader.n0tls.com — building and refining the podcast/reader backend. This is the moment where I pivoted to planning the voice-evals experiment. Claude couldn't find any prior notes on the idea (they weren't written down), so we rebuilt the scope from scratch in conversation before handing off to a new agent session.

3 prompts · excerpt from 2026-04-05

#69

I was curious about all the open weight models we could pull levers on to see what quality we can change in the voice models. I used the word lever for sure, so maybe grep for that

5 bash

Not there either. It wasn't documented anywhere in the repos or memory files — it was likely just a conversation that didn't make it into a file.

Let's just start fresh. Tell me what you remember about the scope — were you thinking about:

  • Piper specifically (tuning length_scale, noise_scale, noise_w)
  • Comparing multiple different models (Piper, Kokoro, Coqui, Orpheus, etc.)
  • Both — a grid of models × parameter settings with audio samples output for each cell?

#70

Yes I was thinking of maybe making a little web app page that can demonstrate different models and have sliders to change settings and then re-encode the same phrase through different levers we can pull on the models.

That's a fun evaluation tool. Let me make sure I understand the shape of it before we build anything:

The idea as I understand it:

  • Single web page, local dev tool (not production)
  • Fixed sample phrase (or user-entered text)
  • Per-model panel with sliders for that model's parameters
  • Hit "Generate" → calls a backend → plays the audio back in the browser
  • Bonus: a "compare" matrix view that shows all combinations at once

Models + levers I know about:

Model Levers
Piper (already installed) length_scale (speed), noise_scale (phoneme variation), noise_w (prosody timing), voice model selection
Kokoro-82M (ONNX, Apache 2.0) voice/style selection, speed
Coqui XTTS speaker embedding, language, speed
Orpheus emotion tags, voice

The stack question: TTS models are all Python-based, so the backend would naturally be a small FastAPI app that shells out to each model. The frontend is just HTML/JS with sliders. Totally standalone.

A few things to nail down before I start:

  1. Where does this run? Your Mac locally, or a remote server?
  2. Which models do you actually have installed / want to include first? Piper is ready. Do you have Kokoro or others set up?
  3. Matrix vs interactive? Do you want the "run all combinations at once" grid view, or just one model/config at a time?
#71
  1. I want to run these locally and remotely, but mostly remotely for this first pass as I don't want to overwork my tiny laptop.
  2. Just piper on the remote server
  3. Interactive for now, matrix when we have final outputs

Can you create a new private repo for this project under GitHub username redacted/open-voice-model-evals, once done, drop our notes about this specific endeavour into there, so we can have a new Claude session read it.

6 bash · 1 write

Done. Private GitHub repo created — URL omitted

CLAUDE.md has everything a new session needs to pick this up:

  • Goal and phase 1 scope
  • Remote server details + where Piper lives [server connection info redacted]
  • All four Piper levers with CLI flags, defaults, and ranges
  • Example subprocess invocation
  • API endpoints to build (/, /models, /generate)
  • Frontend behaviour spec
  • Future models table
  • Decision log

When you're ready to build, just open a new session from the project directory and it'll have full context.