Claude Code transcript — Voice Evals Implementation

Context: This is the implementation session for experiments.n0tls.com/voice-evals/. It picks up directly after the ideation session where the project scope was defined and a CLAUDE.md was written. A new Claude session read that CLAUDE.md and built the whole tool from scratch in one sitting. Server IP addresses and other deployment details have been redacted.

40 prompts · 8 pages

#1

This is a project to do some open voice model evals, can you take at the files in place and let me know what you think is the next step

1 glob

The repo only has CLAUDE.md — no code yet. Based on the CLAUDE.md spec, the next step is scaffolding the backend:

  1. main.py — FastAPI app with /, /models, and /generate endpoints
  2. requirements.txt — FastAPI, uvicorn, python-multipart
  3. Static HTML frontend — served from /

Want me to build all three now? I can write main.py with the full backend logic (piper subprocess → ffmpeg → MP3) and the single-page frontend, then you can deploy to Hetzner.

#2

go ahead split it amongst a team of agents where it makes sense

3 agent

Three agents running in parallel:

  • Backend agentmain.py (FastAPI with /, /models, /generate endpoints, async piper→ffmpeg pipeline)
  • Frontend agentstatic/index.html (dark-theme UI with sliders, model picker, audio player)
  • Infra agentrequirements.txt + deploy.sh (rsync + nohup deploy to Hetzner)

I'll let you know when they're all done.

#3

[Agent completed: Wrote main.py — FastAPI backend with /models (scans server for .onnx files) and /generate (chains piper → ffmpeg async, returns MP3 bytes). Listens on port 8765.]

#4

[Agent completed: Wrote requirements.txt (three pinned dependencies) and deploy.sh — rsync to the server, then SSH in to install deps, kill any running uvicorn, and start it with nohup.]

#5

[Agent completed: Wrote static/index.html — single-file dark-theme UI with model dropdown, three sliders (speed/phoneme/prosody) with live readouts, generate button with loading state, inline audio player, and error handling.]

All three agents done. Here's what's in place:

File What it does
main.py FastAPI app — /models scans for .onnx files, /generate chains piper→ffmpeg async
static/index.html Dark-theme UI with sliders, model picker, audio player, latency display
requirements.txt fastapi, uvicorn[standard], python-multipart
deploy.sh rsync to Hetzner + SSH to install deps and start uvicorn on port 8765

To deploy: bash deploy.sh — it'll rsync everything and start the server at http://[server-ip]:8765.

One thing to verify on the server first: the actual sample rate of your piper models. The code hardcodes 22050 Hz (correct for lessac-high) but other .onnx files may differ — check their .onnx.json config files if audio sounds warped.

#6

Get this setup so I can see it at experiments.n0tls.com where this page is indexed on that page, and we go to a path where the actual page is hosted. Go ahead and use hcloud and ssh to do the deployment.

Let me know what I need to do add to DNS to get that working

2 read · 2 edit · 1 agent

Everything is deployed and running. Here's what you need to do for DNS:

Add one DNS record:

A    experiments    [server-ip]

(i.e. experiments.n0tls.com[server-ip])

Once that propagates, the app will be live at http://experiments.n0tls.com/voice-evals/ with the index at http://experiments.n0tls.com/. One Piper model was found: en_US-lessac-high.

If you want HTTPS, let me know — I can add a Let's Encrypt cert with certbot.

#7

As I generate these, I want them to be saved to the local FS so we can retrieve it later, i would like to know what parameters were used for that file, so we might need to store metadata somehow, either on the file, or in a small db, you choose

4 edit · 2 read · 1 write · 1 bash

Deployed and running. Here's what changed:

Backend (main.py): - On startup, creates /opt/voice-evals/audio/ dir and initializes generations.db (SQLite) - /generate now saves each MP3 as audio/{uuid}.mp3 and inserts a row with all params + generation time. Returns X-Generation-Id header. - GET /generations — returns last 200 generations newest-first - GET /audio/{id} — streams a saved MP3 by ID

Frontend: - History panel below the generate button, loaded on page open - Each new generation is prepended instantly (no reload needed) - Each history card shows: timestamp, model name, speed/phoneme/prosody values, generation time, and an inline audio player

The DB schema:

id, created_at, text, model, length_scale, noise_scale, noise_w, generation_time, file_path
#8

Add a field for each generation so i can rate it on my own preference scale while we try everything out

5 edit · 1 read · 1 bash
#9

Can you download the open weight voice models that are popular, pick the top 5

#10

Can you download the open weight voice models to the server so we can test them, pick the top 5 popular ones

2 edit · 1 agent · 1 bash

All 6 models showing in the dropdown. Here's what's now available:

Model Voice Size Notes
en_US-lessac-high US male 109 MB Was already there
en_US-ryan-high US male 116 MB High quality, community favourite
en_US-libritts_r-medium US neutral 75 MB Trained on LibriTTS, very clean
en_US-amy-medium US female 61 MB
en_GB-alan-medium British male 61 MB
en_US-joe-medium US male 61 MB

Also fixed the hardcoded 22050 Hz — the backend now reads the correct sample rate from each model's .onnx.json config automatically.

#11

the slider for speed is unintuitive, moving sliders to higher values made it slower

14 edit · 4 grep · 3 read · 1 bash

Deployed. Two changes in one:

Speed slider fix — the slider now works as a speed multiplier (0.5x–2.0x). Right = faster. Under the hood it sends length_scale = 1 / speed to Piper, so the math is correct. The value displays with an x suffix (e.g. 1.00x).

Info icons — each slider label has a small i button. Click it to see a popover explaining what the parameter does and what to watch out for at extremes. Clicking again or anywhere else dismisses it.

#12

are there other models we can add? If something gets rater 4-5 star i want us to add that to the list of things we show on a comparison matrix

7 edit · 1 read · 1 bash

Two things to address — let me answer the models question first, then implement the matrix.

More models: Beyond the 5 Piper voices already installed, the realistic additions for a CPU server are: - More Piper voices — there are ~30 English voices total (different speakers, accents, quality tiers). Easy drop-in. - Kokoro-82M — the most popular non-Piper open-weight TTS right now. ONNX, Apache 2.0, noticeably better quality than Piper. Runs ~0.5x realtime on CPU (~30s for a typical sentence). Would need a small backend addition. - Coqui XTTS v2 / Orpheus — good quality but much slower on CPU, probably not worth it for evals.

I'd suggest we add Kokoro after you've rated the Piper voices. Now let me build the matrix feature.

Plan: auto-add to matrix when rated ≥4, remove when rating drops below. New /matrix endpoint. Matrix section in the UI shows a scrollable grid of cards side by side.

Deployed. Here's what's new:

Comparison Matrix — lives above the history list. Any generation you rate 4 or 5 stars instantly appears there as a card with the model name, text snippet, params, and its own audio player. Dropping the rating below 4 removes it. Cards with 5 stars get a blue border to distinguish them.

On more models — once you've rated the Piper voices, the next worthwhile addition is Kokoro-82M. It needs about 10 lines of new backend code (different invocation from Piper — Python API rather than subprocess) and ~300MB download. Want me to add it now or after you've done a round of Piper evals?

#13

Can you delete all the files we've generated, I want to start fresh now that we have the UX I'll need to make a decision

#14

yes

3 bash
#15

Have a little icon that loads the settings of a voice in the comparison matrix into the evaluator slider settings, it should let me rapidly experiment on tuning the best ones so far.

2 edit · 1 read · 1 bash
#16

Is there a way for us to generate the file on our server, but the user downloads in browser to play so we can immediately remove the file once downloaded? I'd like to release this as a small little tool for folks wanting to learn what the switches do, but don't want to get abused. Also add a limit of 3x amount of words in this example

The Dimensionality Problem

There is an apparent contradiction at the heart of expertise. Expert judgement is learnable, in the sense that people demonstrably acquire it over time. It is also non-transmissible, in the sense that no expert can transfer their judgement to another person through explanation. If it was once learnable, why can it not be taught?

Anything else we should do to be able to release this for public consumption

3 bash · 2 write

Good instincts. Here's everything worth doing before public release:

  1. No disk persistence — already stream bytes in memory, just remove the save (done)
  2. Word limit — sample text is ~57 words, 3x = 175 cap
  3. Concurrency semaphore — CPU-bound, only 1 TTS at a time; return 503 if busy rather than queuing
  4. Generation timeout — kill runaway piper/ffmpeg after 60s
  5. Input sanitization — strip control characters
  6. HTTPS — Let's Encrypt via certbot
  7. nginx rate limiting — 5 req/min per IP at the proxy level
  8. Strip the history/matrix UI — not meaningful for public users, and the DB adds unnecessary state
  9. "No data stored" notice — trust signal for users

Let me implement all of this:

Good news — your new message changes the architecture for the better. Yes, absolutely do it client-side. For a public single-user-per-browser tool, localStorage is perfect: no privacy concerns, no server state, no DB to maintain. The server stays as pure compute (just /models + /generate).

The only tradeoff: audio can't be stored in localStorage (too large), so history cards get a Replay button that re-calls /generate with the saved params instead of playing back a cached file.

Let me rewrite the frontend with that architecture while also handling the public release concerns:

Yes — great instinct, and worth addressing before I write the new frontend. Since there's no auth and every visitor is a stranger, client-side storage makes much more sense:

  • Each user gets their own private history and matrix — nothing shared, nothing to leak
  • Backend stays stateless — just /models and /generate, no DB to maintain or back up
  • Scales for free — server only does CPU work, not storage
  • Better privacy story — genuinely "nothing stored on our servers"

The right tool is IndexedDB (not localStorage): it can store audio Blobs directly, has effectively unlimited capacity per-browser, and the audio never has to hit the server again for playback. localStorage can hold the rating/settings metadata as a lightweight index.

Let me write the new frontend with all of this built in:

All deployed. Here's the full picture:

What changed:

  • No server-side storage — backend is now stateless. Files are never written to disk. DB is gone.
  • Client-side IndexedDB — history, ratings, audio blobs, and the comparison matrix all live in the user's browser. Each visitor has their own isolated state. Audio stays playable across page reloads without re-generating.
  • Word limit — 175 words, enforced both client-side (live counter, red border) and server-side
  • Concurrency guard — only one TTS job runs at a time; returns a friendly "Server is busy" notice instead of queuing
  • 60s timeout — runaway jobs are killed
  • nginx rate limit — 5 requests/minute per IP, burst of 3
  • X-Robots-Tag: noindex — search engines won't index it
  • Input sanitization — control characters stripped server-side

To finish HTTPS — once your DNS A record for experiments.n0tls.com is live, run this on the server:

! ssh root@[server-ip] "apt-get install -y certbot python3-certbot-nginx && certbot --nginx -d experiments.n0tls.com"

Certbot will rewrite the nginx config automatically to add the SSL block and redirect.

#17

check dns yourself it should be there, and run the command for me when it works

2 bash
#18

Can you see old db values that we had for what i previously liked i forgot to write it down

1 bash

Good news — the DB was still there (we cleared it of files but not the table rows). Here are your rated entries, cleaned up. Note: length_scale stored is the raw Piper value, so speed = 1/length_scale.

5-star ⭐⭐⭐⭐⭐

Model Speed Phon Prosody
en_GB-alan-medium 1.25x 0.70 1.20
en_US-amy-medium 1.25x 0.15 0.55
en_US-libritts_r-medium 1.00x 0.80 0.90
en_US-libritts_r-medium 1.00x 0.75 0.90
en_US-libritts_r-medium 1.00x 0.75 0.05

4-star ⭐⭐⭐⭐

Model Speed Phon Prosody
en_GB-alan-medium 1.25x 0.05 1.15
en_GB-alan-medium 1.25x 0.70 0.15
en_US-amy-medium 1.25x 0.15 1.20
en_US-amy-medium 1.25x 0.15 0.30

Pattern: you consistently preferred 1.25x speed on Alan and Amy, and normal speed on libritts. libritts at phon 0.75, prosody 0.05 appeared three times in your 5-stars — looks like that was a clear winner worth revisiting.

#19

Can you write the settings into a markdown document i can paste into another agent

For the following:

en_US-amy-medium settings under 5 stars en_US-libritts_r-medium

#20

Can you write the settings into a markdown document i can paste into another agent

For the following:

en_US-amy-medium settings under 5 stars en_US-libritts_r-medium All three of these settings

## Top-Rated Piper Voice Settings

### en_US-amy-medium — ⭐⭐⭐⭐⭐

| Parameter | CLI flag | Value |
|---|---|---|
| Model | `--model` | `en_US-amy-medium.onnx` |
| Speed | `--length-scale` | `0.8` |
| Phoneme variation | `--noise-scale` | `0.15` |
| Prosody variation | `--noise-w` | `0.55` |

---

### en_US-libritts_r-medium — ⭐⭐⭐⭐⭐ (3 variants)

**Variant A**

| Parameter | CLI flag | Value |
|---|---|---|
| Model | `--model` | `en_US-libritts_r-medium.onnx` |
| Speed | `--length-scale` | `1.0` |
| Phoneme variation | `--noise-scale` | `0.80` |
| Prosody variation | `--noise-w` | `0.90` |

**Variant B**

| Parameter | CLI flag | Value |
|---|---|---|
| Model | `--model` | `en_US-libritts_r-medium.onnx` |
| Speed | `--length-scale` | `1.0` |
| Phoneme variation | `--noise-scale` | `0.75` |
| Prosody variation | `--noise-w` | `0.90` |

**Variant C**

| Parameter | CLI flag | Value |
|---|---|---|
| Model | `--model` | `en_US-libritts_r-medium.onnx` |
| Speed | `--length-scale` | `1.0` |
| Phoneme variation | `--noise-scale` | `0.75` |
| Prosody variation | `--noise-w` | `0.05` |

---

> `length-scale` is the raw Piper value (lower = faster). The UI displayed speed as `1/length-scale` (e.g. `0.8` → `1.25x`).
#21

[Image #1] slight bug with how the load button loaded here, it has a border outside the bigger box, fix that

#22

[User attached screenshot]

1 edit · 1 bash
#23

still looks incorrect, do we need to restart services?

2 edit · 1 bash

The static file is read fresh each request so no restart needed for that — but let me look more carefully at the actual bug. The mc-load-btn is a <button>, not a <label>, so my last fix didn't address it. The real issue is the card has border-radius but no overflow: hidden, so the button's own border peeks out at the corner.

#24

Add an exception to nginx for the IP im currently going to the internet out

2 bash
#25

the load button now looks like its behind the background, it really just needs to shift a little to the left and it'll be fine

1 edit · 1 bash
#26

[Image #2] still there, not sure whats happening attached an image

#27

[User attached screenshot]

2 edit · 1 bash

The button is getting clipped by overflow: hidden on the card. The real fix is to remove overflow: hidden and instead use overflow: visible with clip-path only on the audio element if needed — but actually the simplest fix is just remove overflow: hidden from the card and handle the corner rounding differently.

#28

[Image #3] it still exceeds the outer border slightly

#29

[User attached screenshot]

1 edit · 1 bash
#30

Can you make the arrow a tad bigger, its' tiny

1 edit · 1 bash
#31

Are these models ever able to be run in the browser? Or is this something that generally happens on a server, or a decent machine rather than in browser?

TTS models can run in the browser, but with real trade-offs:

What works in-browser today: - Piper — there's a WebAssembly port (piper-wasm) that runs entirely client-side. Quality is identical, but inference is 3–5x slower than native since WASM doesn't fully utilize SIMD. On a modern laptop it's usable; on mobile it's rough. - Kokoro — runs via ONNX Runtime Web (WebAssembly + WebGPU backend). With WebGPU on a decent GPU-equipped machine it's actually fast. Without GPU it's slow but functional. - Small VITS models — similar story, ONNX Runtime Web handles them fine.

What doesn't work well in-browser: - Anything over ~200MB — model download before first use is a non-starter for casual users - Models requiring custom CUDA kernels or complex pre/post-processing - Mobile devices — even medium models are painful

The practical split:

Scenario Best approach
Personal tool / power users Server (what you have now)
Public demo, desktop users Either — WASM/WebGPU is viable
Public demo, mobile users Server — no question
Offline / privacy-first app In-browser WASM

For your use case specifically — since this is a public eval tool on a shared CPU server, server-side is the right call. The main risk in-browser would be that a phone user waits 30 seconds for a 150MB model to download and then another 30s to generate. Not a great first impression.

If you wanted to go in-browser for a future version, Kokoro via ONNX Runtime Web + WebGPU would be the one worth trying — it's the best quality-to-size ratio and the ONNX ecosystem has solid browser support.

#32

Does anyone have a tool like this that you know of?

A few exist but none quite like what you've built:

Closest things: - Hugging Face Spaces — individual model demos exist (e.g. for Kokoro, Parler-TTS), but they're one model at a time, no parameter sliders, no side-by-side comparison, no rating system. You have to bounce between tabs to compare. - TTS Arena (by Hugging Face) — blind A/B listening test to rank models, but it's fixed samples, no parameter control, and you're voting on quality not tuning settings. - Coqui Studio (now defunct) — had a decent UI but was commercial, closed source, and is gone. - elevenlabs.io / play.ht — commercial, closed models, no parameter exposure.

What you have that none of them do: - Sliders for the actual synthesis parameters, not just voice selection - Rating system tied to specific parameter combos - Comparison matrix of your personal top picks - History with exact settings so you can reproduce anything - Self-hosted, so you control the models and compute

The gap in the market is specifically the parameter exploration angle — most tools treat TTS as a black box and just let you pick a voice. The fact that yours exposes noise_scale, noise_w, and length_scale with instant feedback is genuinely useful for anyone trying to understand what these knobs actually do, which is the stated goal.

If you open sourced it with a good README it'd likely get traction in the self-hosting / local AI communities (r/selfhosted, Hacker News, Hugging Face etc).

#33

Can you put some lorem ipsum for the sample text

1 edit · 1 bash
#34

Er actually, it needs to be in english otherwise it might be gibberish

1 edit · 1 bash