English
FLEURS test/en_us/1042003289011443756.wav
Reference transcript: The Internet combines elements of both mass and interpersonal communication.
Generated text: This is a short voice cloning benchmark for on-device speech.
We cloned one dataset reference speaker per language, generated the same short benchmark sentence in each language, then scored speaker similarity, ASR recovery, and UTMOS predicted quality. Every row below has the reference clip and the generated output next to the numbers.
FLEURS test/en_us/1042003289011443756.wav
| Model | Precision | Cosine | ASR | UTMOS | Audio | RTF |
|---|---|---|---|---|---|---|
| OmniVoice | int8 | 0.701 | 0.0% WER | 3.96 | 3.80 s | 0.48 |
| VoxCPM2 | bf16 | 0.624 | 0.0% WER | 3.46 | 3.68 s | 1.84 |
| Chatterbox Multilingual | fp16 | 0.625 | 0.0% WER | 4.00 | 3.80 s | 0.94 |
| Fish Audio S2 Pro | fp16 | 0.590 | 0.0% WER | 4.20 | 3.44 s | 3.15 |
| Qwen3-TTS Base | 1.7B bf16 ICL | 0.159 | 0.0% WER | 1.93 | 4.55 s | 1.82 |
FLEURS test/de_de/10342213717361642954.wav
| Model | Precision | Cosine | ASR | UTMOS | Audio | RTF |
|---|---|---|---|---|---|---|
| OmniVoice | int8 | 0.837 | 0.0% WER | 3.11 | 3.47 s | 0.50 |
| Qwen3-TTS Base | 1.7B bf16 ICL | 0.765 | 9.1% WER | 3.53 | 3.66 s | 1.78 |
| Fish Audio S2 Pro | fp16 | 0.749 | 0.0% WER | 3.46 | 3.76 s | 3.08 |
| VoxCPM2 | bf16 | 0.733 | 9.1% WER | 2.92 | 4.16 s | 1.75 |
| Chatterbox Multilingual | fp16 | 0.727 | 9.1% WER | 3.56 | 3.92 s | 0.87 |
FLEURS test/ar_eg/10863341459609935739.wav
| Model | Precision | Cosine | ASR | UTMOS | Audio | RTF |
|---|---|---|---|---|---|---|
| VoxCPM2 | bf16 | 0.757 | 0.0% WER | 2.60 | 3.20 s | 1.82 |
| Fish Audio S2 Pro | fp16 | 0.686 | 14.3% WER | 3.31 | 3.48 s | 3.13 |
| Chatterbox Multilingual | fp16 | 0.674 | 0.0% WER | 3.37 | 4.40 s | 0.88 |
| OmniVoice | int8 | 0.621 | 0.0% WER | 2.99 | 3.97 s | 0.47 |
FLEURS test/es_419/16388069031423373053.wav
| Model | Precision | Cosine | ASR | UTMOS | Audio | RTF |
|---|---|---|---|---|---|---|
| OmniVoice | int8 | 0.684 | 0.0% WER | 2.73 | 5.26 s | 0.36 |
| Chatterbox Multilingual | fp16 | 0.670 | 0.0% WER | 3.44 | 4.02 s | 0.92 |
| VoxCPM2 | bf16 | 0.658 | 0.0% WER | 2.64 | 3.36 s | 1.80 |
| Fish Audio S2 Pro | fp16 | 0.584 | 0.0% WER | 2.68 | 3.53 s | 3.15 |
| Qwen3-TTS Base | 1.7B bf16 ICL | 0.493 | 0.0% WER | 2.24 | 3.98 s | 1.80 |
FLEURS test/cmn_hans_cn/5479411876618006152.wav
| Model | Precision | Cosine | ASR | UTMOS | Audio | RTF |
|---|---|---|---|---|---|---|
| OmniVoice | int8 | 0.690 | 0.0% CER | 3.11 | 4.00 s | 0.42 |
| VoxCPM2 | bf16 | 0.658 | 0.0% CER | 3.10 | 2.88 s | 1.90 |
| Qwen3-TTS Base | 1.7B bf16 ICL | 0.655 | 0.0% CER | 3.64 | 3.42 s | 1.80 |
| Fish Audio S2 Pro | fp16 | 0.598 | 0.0% CER | 3.90 | 3.11 s | 3.20 |
Higher speaker cosine means the generated clip is closer to the FLEURS reference speaker embedding. Lower WER/CER means Qwen3-ASR recovered the requested text more cleanly. Higher UTMOS is a no-reference predicted naturalness score on a 1-5 scale. Lower RTF is faster. These are engineering regression metrics, not a human MOS panel. Precision is listed per row: VoxCPM2’s public full-precision Swift path is bf16, while OmniVoice is shown with the published int8 bundle because the fp16 backbone was not used for this published run. Quantized rows can change both quality and speed, so the table only includes the actual path measured for each row.
| Source | Role | Rate |
|---|---|---|
| FLEURS references | Input reference | 16 kHz |
| OmniVoice / Chatterbox | Generated output | 24 kHz |
| Qwen3-TTS Base ICL | Generated output | 24 kHz |
| Fish Audio S2 Pro | Generated output | 44.1 kHz |
| VoxCPM2 | Generated output | 48 kHz |
The references are 16 kHz, so they only contain observable evidence up to roughly 8 kHz. Higher-rate outputs can sound more open because they preserve or synthesize more high-band breath, sibilance, and room detail, but 44.1 or 48 kHz output does not automatically mean a better clone. Most speaker identity and intelligibility cues live well below that high band, while excessive high-frequency energy can make a voice feel sharp even when WER and speaker cosine look good.
FLEURS test/en_us/1042003289011443756.wav
Reference transcript: The Internet combines elements of both mass and interpersonal communication.
Generated text: This is a short voice cloning benchmark for on-device speech.
FLEURS test/de_de/10342213717361642954.wav
Reference transcript: Es ist also möglich, dass der Vermerk einfach als Kennzeichnung hinzugefügt wurde.
Generated text: Dies ist ein kurzer Benchmark für lokale Sprachklonung auf dem Gerät.
FLEURS test/ar_eg/10863341459609935739.wav
Reference transcript: لا تشوه الموقع بوضع علامات أو الكتابات الخادشة على الجدران في المباني.
Generated text: هذا اختبار قصير لاستنساخ الصوت على الجهاز.
FLEURS test/es_419/16388069031423373053.wav
Reference transcript: Internet une y mezcla componentes propios de la comunicación masiva y entre personas.
Generated text: Esta es una breve prueba de clonación de voz en el dispositivo.
FLEURS test/cmn_hans_cn/5479411876618006152.wav
Reference transcript: 互联网结合了大众传播和人际传播的要素。
Generated text: 这是一个简短的本地语音克隆测试。
References are single clips from the Google FLEURS test split: English, German, Arabic, Spanish, and Mandarin Chinese. For engines that accept a reference transcript, the exact FLEURS transcript was passed with the audio prompt.
The score shape mirrors the objective side of VoxCPM-style voice cloning evaluation: intelligibility via WER/CER, and cloning via speaker-embedding cosine similarity. The speaker encoder here is Soniqo’s `speech embed-speaker --engine mlx`, so compare rows inside this table, not against paper SIM percentages directly.
UTMOS is computed with `utmos22_strong` from SpeechMOS after resampling generated clips to 16 kHz. It gives a no-reference naturalness signal that WER and speaker cosine miss, but it is still a model prediction rather than a human listening study.
Qwen3-TTS Base is measured with the public speech-swift ICL API (`Qwen3TTSModel.fromPretrainedWithEncoder` + `synthesizeWithVoiceCloneICL`) using the same FLEURS reference audio and transcript. It is omitted for Arabic because the current Qwen3-TTS language set exposed here does not include Arabic.
Chatterbox’s upstream language list includes Chinese, but the current Swift frontend only supports the direct tokenizer path for `en`, `ar`, `hi`, `de`, `es`, `fr`, `it`, and `pt`; the Chinese row is intentionally omitted until that frontend lands.
# Public speech-swift CLI example for one generated row.
speech speak "$TEXT" \
--engine voxcpm2 \
--voxcpm2-variant bf16 \
--voxcpm2-ref-audio reference.wav \
--language arabic \
--output generated.wav
speech embed-speaker reference.wav --engine mlx --json
speech embed-speaker generated.wav --engine mlx --json
speech transcribe generated.wav --engine qwen3 --model 0.6B --language arabicA clone can sound like the speaker but say the wrong text, or say the text clearly while missing the speaker. It can also match the text and speaker while carrying audible artifacts. Speaker cosine, ASR error, and UTMOS catch different failure modes, so all three need to be visible.
Good when you want a cloned speaker with simple delivery guidance, such as a calmer, younger, lower-pitched, or whispered read. This benchmark used a neutral delivery.
Useful when you want the same speaker to sound more restrained or more animated without writing emotion tags into the text. This benchmark kept expressiveness neutral.
Strong fit when you want to describe the target voice or delivery in natural language while still cloning from a reference clip. This benchmark used the reference clip only.
Best when the script needs explicit moments like laughing, whispering, excitement, or sadness. This benchmark used plain text with no acting cues.
Uses the FLEURS clip and transcript as in-context conditioning. This is the comparable Qwen3-TTS path for sample-based cloning in the table.
The benchmark intentionally leaves these controls neutral. That keeps speaker similarity tied to the FLEURS reference instead of rewarding a model for adding extra emotion, whispering, shouting, or laughter.
Treat this as a compact engineering check, not a leaderboard. One reference clip per language is useful for finding obvious regressions, but it is too small to crown a universal winner. OmniVoice is fast and text-stable in this run, but UTMOS shows that quality varies by language instead of following WER or cosine cleanly.
Fish Audio and Chatterbox often score better on UTMOS even when another model has a higher speaker cosine. Qwen3-TTS Base ICL is included for English, German, Spanish, and Chinese; it has exact ASR recovery on three of four rows here, but speaker similarity and quality vary sharply by language. The right reading is per-language tradeoffs, not a single winner.