Speech Studio
Open-source Mac app for local voice cloning and multi-speaker dialog generation. Drop a voice sample, clone it, write a scene, synthesize — all on your laptop. No API keys, no cloud, no per-character pricing.
A 30-second blind test: a real voice, the same voice cloned locally by Speech Studio, and the same voice cloned by ElevenLabs in the cloud. Can you tell which is which?
What it does
- Voice cloning from a short reference — drop in a few seconds of speech, clone the voice locally.
- Multi-speaker dialog generation — write a scene with multiple speakers, synthesize all of them in one pass.
- Runs entirely on your Mac — VoxCPM2 via MLX, DeepFilterNet3 for noise suppression, no network required.
- Open source under Apache 2.0 — fork it, embed it, build on it.
Requirements
- macOS 15+ (Apple Silicon), Windows 10+ (x64), or Linux (x64)
- Apple Silicon on Mac; any modern 64-bit CPU on Windows/Linux
- 8 GB RAM minimum (16 GB recommended)
- ~3–5 GB disk for the speech models (downloaded on first run)
Install
Download the build for your platform from GitHub Releases — macOS .dmg, Windows .msi/.exe, or Linux .deb/.AppImage — then launch it:
The builds are unsigned: on macOS open via right-click → Open (or System Settings → Privacy & Security → Open anyway); on Windows choose More info → Run anyway in SmartScreen. First launch downloads the VoxCPM2 speech model (~2.75 GB on macOS, ~4.6 GB on Windows/Linux) and caches it; later launches reuse the cache.
The same voice cloning pipeline ships in the speech CLI: brew install speech, then speech speak --engine voxcpm2 --voxcpm2-ref-audio reference.wav -o cloned.wav "Hello, this is my cloned voice." — useful for scripting or pre-rendering batches. See the voice cloning guide for the full flow.
Speech Studio is in active preview (v0.0.4), with installers for macOS, Windows, and Linux — macOS clones via MLX, Windows and Linux via speech-core's LiteRT VoxCPM2 engine. The source repo at github.com/soniqo/speech-studio tracks the GUI app; star/watch it for release notifications.
What it's built on
Speech Studio is a thin GUI on top of speech-swift, the open-source Swift library that ships every model used in the demo:
- VoxCPM2 — the voice cloning model (zero-shot, short reference)
- DeepFilterNet3 — denoise the reference + cloned output
- Qwen3-ASR — align speech to text (used in the demo's blind-test build pipeline)
- Forced Alignment — word-level timestamps for editing
- Voice Cloning guide — full overview of the pipeline
Roadmap
- Today: macOS, Windows, and Linux.
- Next: signed & notarized builds (no Gatekeeper/SmartScreen prompts).
- After that: deeper editing surface, plugin support for swappable cloning models.
Feedback
Open an issue at github.com/soniqo/speech-studio/issues — every one gets read.