Blog
Voice cloning benchmarks
July 2, 2026

Voice cloning models, measured across five languages.

We cloned one dataset reference speaker per language, generated the same short benchmark sentence in each language, then scored speaker similarity, ASR recovery, and UTMOS predicted quality. Every row below has the reference clip and the generated output next to the numbers.

English

FLEURS test/en_us/1042003289011443756.wav

ModelPrecisionCosineASRUTMOSAudioRTF
OmniVoiceint80.7010.0% WER3.963.80 s0.48
VoxCPM2bf160.6240.0% WER3.463.68 s1.84
Chatterbox Multilingualfp160.6250.0% WER4.003.80 s0.94
Fish Audio S2 Profp160.5900.0% WER4.203.44 s3.15
Qwen3-TTS Base1.7B bf16 ICL0.1590.0% WER1.934.55 s1.82

German

FLEURS test/de_de/10342213717361642954.wav

ModelPrecisionCosineASRUTMOSAudioRTF
OmniVoiceint80.8370.0% WER3.113.47 s0.50
Qwen3-TTS Base1.7B bf16 ICL0.7659.1% WER3.533.66 s1.78
Fish Audio S2 Profp160.7490.0% WER3.463.76 s3.08
VoxCPM2bf160.7339.1% WER2.924.16 s1.75
Chatterbox Multilingualfp160.7279.1% WER3.563.92 s0.87

Modern Standard Arabic

FLEURS test/ar_eg/10863341459609935739.wav

ModelPrecisionCosineASRUTMOSAudioRTF
VoxCPM2bf160.7570.0% WER2.603.20 s1.82
Fish Audio S2 Profp160.68614.3% WER3.313.48 s3.13
Chatterbox Multilingualfp160.6740.0% WER3.374.40 s0.88
OmniVoiceint80.6210.0% WER2.993.97 s0.47

Spanish

FLEURS test/es_419/16388069031423373053.wav

ModelPrecisionCosineASRUTMOSAudioRTF
OmniVoiceint80.6840.0% WER2.735.26 s0.36
Chatterbox Multilingualfp160.6700.0% WER3.444.02 s0.92
VoxCPM2bf160.6580.0% WER2.643.36 s1.80
Fish Audio S2 Profp160.5840.0% WER2.683.53 s3.15
Qwen3-TTS Base1.7B bf16 ICL0.4930.0% WER2.243.98 s1.80

Chinese

FLEURS test/cmn_hans_cn/5479411876618006152.wav

ModelPrecisionCosineASRUTMOSAudioRTF
OmniVoiceint80.6900.0% CER3.114.00 s0.42
VoxCPM2bf160.6580.0% CER3.102.88 s1.90
Qwen3-TTS Base1.7B bf16 ICL0.6550.0% CER3.643.42 s1.80
Fish Audio S2 Profp160.5980.0% CER3.903.11 s3.20

Higher speaker cosine means the generated clip is closer to the FLEURS reference speaker embedding. Lower WER/CER means Qwen3-ASR recovered the requested text more cleanly. Higher UTMOS is a no-reference predicted naturalness score on a 1-5 scale. Lower RTF is faster. These are engineering regression metrics, not a human MOS panel. Precision is listed per row: VoxCPM2’s public full-precision Swift path is bf16, while OmniVoice is shown with the published int8 bundle because the fp16 backbone was not used for this published run. Quantized rows can change both quality and speed, so the table only includes the actual path measured for each row.

Input and output sample rates

SourceRoleRate
FLEURS referencesInput reference16 kHz
OmniVoice / ChatterboxGenerated output24 kHz
Qwen3-TTS Base ICLGenerated output24 kHz
Fish Audio S2 ProGenerated output44.1 kHz
VoxCPM2Generated output48 kHz

The references are 16 kHz, so they only contain observable evidence up to roughly 8 kHz. Higher-rate outputs can sound more open because they preserve or synthesize more high-band breath, sibilance, and room detail, but 44.1 or 48 kHz output does not automatically mean a better clone. Most speaker identity and intelligibility cues live well below that high band, while excessive high-frequency energy can make a voice feel sharp even when WER and speaker cosine look good.

Reference audio and generated clones

English reference

English

FLEURS test/en_us/1042003289011443756.wav

Reference transcript: The Internet combines elements of both mass and interpersonal communication.

Generated text: This is a short voice cloning benchmark for on-device speech.

OmniVoice
int8 clone from the English reference
Cosine
0.701
ASR
0.0% WER
UTMOS
3.96
RTF
0.48
VoxCPM2
bf16 clone from the English reference
Cosine
0.624
ASR
0.0% WER
UTMOS
3.46
RTF
1.84
Chatterbox Multilingual
fp16 clone from the English reference
Cosine
0.625
ASR
0.0% WER
UTMOS
4.00
RTF
0.94
Fish Audio S2 Pro
fp16 clone from the English reference
Cosine
0.590
ASR
0.0% WER
UTMOS
4.20
RTF
3.15
Qwen3-TTS Base
1.7B bf16 ICL clone from the English reference
Cosine
0.159
ASR
0.0% WER
UTMOS
1.93
RTF
1.82
German reference

German

FLEURS test/de_de/10342213717361642954.wav

Reference transcript: Es ist also möglich, dass der Vermerk einfach als Kennzeichnung hinzugefügt wurde.

Generated text: Dies ist ein kurzer Benchmark für lokale Sprachklonung auf dem Gerät.

OmniVoice
int8 clone from the German reference
Cosine
0.837
ASR
0.0% WER
UTMOS
3.11
RTF
0.50
Qwen3-TTS Base
1.7B bf16 ICL clone from the German reference
Cosine
0.765
ASR
9.1% WER
UTMOS
3.53
RTF
1.78
Fish Audio S2 Pro
fp16 clone from the German reference
Cosine
0.749
ASR
0.0% WER
UTMOS
3.46
RTF
3.08
VoxCPM2
bf16 clone from the German reference
Cosine
0.733
ASR
9.1% WER
UTMOS
2.92
RTF
1.75
Chatterbox Multilingual
fp16 clone from the German reference
Cosine
0.727
ASR
9.1% WER
UTMOS
3.56
RTF
0.87
Modern Standard Arabic reference

Modern Standard Arabic

FLEURS test/ar_eg/10863341459609935739.wav

Reference transcript: لا تشوه الموقع بوضع علامات أو الكتابات الخادشة على الجدران في المباني.

Generated text: هذا اختبار قصير لاستنساخ الصوت على الجهاز.

VoxCPM2
bf16 clone from the Modern Standard Arabic reference
Cosine
0.757
ASR
0.0% WER
UTMOS
2.60
RTF
1.82
Fish Audio S2 Pro
fp16 clone from the Modern Standard Arabic reference
Cosine
0.686
ASR
14.3% WER
UTMOS
3.31
RTF
3.13
Chatterbox Multilingual
fp16 clone from the Modern Standard Arabic reference
Cosine
0.674
ASR
0.0% WER
UTMOS
3.37
RTF
0.88
OmniVoice
int8 clone from the Modern Standard Arabic reference
Cosine
0.621
ASR
0.0% WER
UTMOS
2.99
RTF
0.47
Spanish reference

Spanish

FLEURS test/es_419/16388069031423373053.wav

Reference transcript: Internet une y mezcla componentes propios de la comunicación masiva y entre personas.

Generated text: Esta es una breve prueba de clonación de voz en el dispositivo.

OmniVoice
int8 clone from the Spanish reference
Cosine
0.684
ASR
0.0% WER
UTMOS
2.73
RTF
0.36
Chatterbox Multilingual
fp16 clone from the Spanish reference
Cosine
0.670
ASR
0.0% WER
UTMOS
3.44
RTF
0.92
VoxCPM2
bf16 clone from the Spanish reference
Cosine
0.658
ASR
0.0% WER
UTMOS
2.64
RTF
1.80
Fish Audio S2 Pro
fp16 clone from the Spanish reference
Cosine
0.584
ASR
0.0% WER
UTMOS
2.68
RTF
3.15
Qwen3-TTS Base
1.7B bf16 ICL clone from the Spanish reference
Cosine
0.493
ASR
0.0% WER
UTMOS
2.24
RTF
1.80
Chinese reference

Chinese

FLEURS test/cmn_hans_cn/5479411876618006152.wav

Reference transcript: 互联网结合了大众传播和人际传播的要素。

Generated text: 这是一个简短的本地语音克隆测试。

OmniVoice
int8 clone from the Chinese reference
Cosine
0.690
ASR
0.0% CER
UTMOS
3.11
RTF
0.42
VoxCPM2
bf16 clone from the Chinese reference
Cosine
0.658
ASR
0.0% CER
UTMOS
3.10
RTF
1.90
Qwen3-TTS Base
1.7B bf16 ICL clone from the Chinese reference
Cosine
0.655
ASR
0.0% CER
UTMOS
3.64
RTF
1.80
Fish Audio S2 Pro
fp16 clone from the Chinese reference
Cosine
0.598
ASR
0.0% CER
UTMOS
3.90
RTF
3.20
Method

Dataset references, not hand-picked demos

References are single clips from the Google FLEURS test split: English, German, Arabic, Spanish, and Mandarin Chinese. For engines that accept a reference transcript, the exact FLEURS transcript was passed with the audio prompt.

The score shape mirrors the objective side of VoxCPM-style voice cloning evaluation: intelligibility via WER/CER, and cloning via speaker-embedding cosine similarity. The speaker encoder here is Soniqo’s `speech embed-speaker --engine mlx`, so compare rows inside this table, not against paper SIM percentages directly.

UTMOS is computed with `utmos22_strong` from SpeechMOS after resampling generated clips to 16 kHz. It gives a no-reference naturalness signal that WER and speaker cosine miss, but it is still a model prediction rather than a human listening study.

Qwen3-TTS Base is measured with the public speech-swift ICL API (`Qwen3TTSModel.fromPretrainedWithEncoder` + `synthesizeWithVoiceCloneICL`) using the same FLEURS reference audio and transcript. It is omitted for Arabic because the current Qwen3-TTS language set exposed here does not include Arabic.

Chatterbox’s upstream language list includes Chinese, but the current Swift frontend only supports the direct tokenizer path for `en`, `ar`, `hi`, `de`, `es`, `fr`, `it`, and `pt`; the Chinese row is intentionally omitted until that frontend lands.

# Public speech-swift CLI example for one generated row.
speech speak "$TEXT" \
  --engine voxcpm2 \
  --voxcpm2-variant bf16 \
  --voxcpm2-ref-audio reference.wav \
  --language arabic \
  --output generated.wav

speech embed-speaker reference.wav --engine mlx --json
speech embed-speaker generated.wav --engine mlx --json
speech transcribe generated.wav --engine qwen3 --model 0.6B --language arabic

Why multiple scores?

A clone can sound like the speaker but say the wrong text, or say the text clearly while missing the speaker. It can also match the text and speaker while carrying audible artifacts. Speaker cosine, ASR error, and UTMOS catch different failure modes, so all three need to be visible.

Emotion and style attribution

OmniVoice
Broad style hints

Good when you want a cloned speaker with simple delivery guidance, such as a calmer, younger, lower-pitched, or whispered read. This benchmark used a neutral delivery.

Chatterbox Multilingual
Expressiveness strength

Useful when you want the same speaker to sound more restrained or more animated without writing emotion tags into the text. This benchmark kept expressiveness neutral.

VoxCPM2
Voice direction in plain words

Strong fit when you want to describe the target voice or delivery in natural language while still cloning from a reference clip. This benchmark used the reference clip only.

Fish Audio S2 Pro
Acted delivery cues

Best when the script needs explicit moments like laughing, whispering, excitement, or sadness. This benchmark used plain text with no acting cues.

Qwen3-TTS Base
Reference-audio ICL

Uses the FLEURS clip and transcript as in-context conditioning. This is the comparable Qwen3-TTS path for sample-based cloning in the table.

The benchmark intentionally leaves these controls neutral. That keeps speaker similarity tied to the FLEURS reference instead of rewarding a model for adding extra emotion, whispering, shouting, or laughter.

Reading the result

Treat this as a compact engineering check, not a leaderboard. One reference clip per language is useful for finding obvious regressions, but it is too small to crown a universal winner. OmniVoice is fast and text-stable in this run, but UTMOS shows that quality varies by language instead of following WER or cosine cleanly.

Fish Audio and Chatterbox often score better on UTMOS even when another model has a higher speaker cosine. Qwen3-TTS Base ICL is included for English, German, Spanish, and Chinese; it has exact ASR recovery on three of four rows here, but speaker similarity and quality vary sharply by language. The right reading is per-language tradeoffs, not a single winner.