That's a fun evaluation tool. Let me make sure I understand the shape of it before we build anything:
The idea as I understand it:
- Single web page, local dev tool (not production)
- Fixed sample phrase (or user-entered text)
- Per-model panel with sliders for that model's parameters
- Hit "Generate" → calls a backend → plays the audio back in the browser
- Bonus: a "compare" matrix view that shows all combinations at once
Models + levers I know about:
| Model |
Levers |
| Piper (already installed) |
length_scale (speed), noise_scale (phoneme variation), noise_w (prosody timing), voice model selection |
| Kokoro-82M (ONNX, Apache 2.0) |
voice/style selection, speed |
| Coqui XTTS |
speaker embedding, language, speed |
| Orpheus |
emotion tags, voice |
The stack question: TTS models are all Python-based, so the backend would naturally be a small FastAPI app that shells out to each model. The frontend is just HTML/JS with sliders. Totally standalone.
A few things to nail down before I start:
- Where does this run? Your Mac locally, or a remote server?
- Which models do you actually have installed / want to include first? Piper is ready. Do you have Kokoro or others set up?
- Matrix vs interactive? Do you want the "run all combinations at once" grid view, or just one model/config at a time?