Voice cloning benchmarks

July 2, 2026

Voice cloning models, measured across five languages.

We cloned one dataset reference speaker per language, generated the same short benchmark sentence in each language, then scored speaker similarity, ASR recovery, and UTMOS predicted quality. Every row below has the reference clip and the generated output next to the numbers.

English

FLEURS test/en_us/1042003289011443756.wav

Model	Precision	Cosine	ASR	UTMOS	Audio	RTF
OmniVoice	int8	0.701	0.0% WER	3.96	3.80 s	0.48
VoxCPM2	bf16	0.624	0.0% WER	3.46	3.68 s	1.84
Chatterbox Multilingual	fp16	0.625	0.0% WER	4.00	3.80 s	0.94
Fish Audio S2 Pro	fp16	0.590	0.0% WER	4.20	3.44 s	3.15
Qwen3-TTS Base	1.7B bf16 ICL	0.159	0.0% WER	1.93	4.55 s	1.82

German

FLEURS test/de_de/10342213717361642954.wav

Model	Precision	Cosine	ASR	UTMOS	Audio	RTF
OmniVoice	int8	0.837	0.0% WER	3.11	3.47 s	0.50
Qwen3-TTS Base	1.7B bf16 ICL	0.765	9.1% WER	3.53	3.66 s	1.78
Fish Audio S2 Pro	fp16	0.749	0.0% WER	3.46	3.76 s	3.08
VoxCPM2	bf16	0.733	9.1% WER	2.92	4.16 s	1.75
Chatterbox Multilingual	fp16	0.727	9.1% WER	3.56	3.92 s	0.87

Modern Standard Arabic

FLEURS test/ar_eg/10863341459609935739.wav

Model	Precision	Cosine	ASR	UTMOS	Audio	RTF
VoxCPM2	bf16	0.757	0.0% WER	2.60	3.20 s	1.82
Fish Audio S2 Pro	fp16	0.686	14.3% WER	3.31	3.48 s	3.13
Chatterbox Multilingual	fp16	0.674	0.0% WER	3.37	4.40 s	0.88
OmniVoice	int8	0.621	0.0% WER	2.99	3.97 s	0.47

Spanish

FLEURS test/es_419/16388069031423373053.wav

Model	Precision	Cosine	ASR	UTMOS	Audio	RTF
OmniVoice	int8	0.684	0.0% WER	2.73	5.26 s	0.36
Chatterbox Multilingual	fp16	0.670	0.0% WER	3.44	4.02 s	0.92
VoxCPM2	bf16	0.658	0.0% WER	2.64	3.36 s	1.80
Fish Audio S2 Pro	fp16	0.584	0.0% WER	2.68	3.53 s	3.15
Qwen3-TTS Base	1.7B bf16 ICL	0.493	0.0% WER	2.24	3.98 s	1.80

Chinese

FLEURS test/cmn_hans_cn/5479411876618006152.wav

Model	Precision	Cosine	ASR	UTMOS	Audio	RTF
OmniVoice	int8	0.690	0.0% CER	3.11	4.00 s	0.42
VoxCPM2	bf16	0.658	0.0% CER	3.10	2.88 s	1.90
Qwen3-TTS Base	1.7B bf16 ICL	0.655	0.0% CER	3.64	3.42 s	1.80
Fish Audio S2 Pro	fp16	0.598	0.0% CER	3.90	3.11 s	3.20

Higher speaker cosine means the generated clip is closer to the FLEURS reference speaker embedding. Lower WER/CER means Qwen3-ASR recovered the requested text more cleanly. Higher UTMOS is a no-reference predicted naturalness score on a 1-5 scale. Lower RTF is faster. These are engineering regression metrics, not a human MOS panel. Precision is listed per row: VoxCPM2’s public full-precision Swift path is bf16, while OmniVoice is shown with the published int8 bundle because the fp16 backbone was not used for this published run. Quantized rows can change both quality and speed, so the table only includes the actual path measured for each row.

Input and output sample rates

Source	Role	Rate
FLEURS references	Input reference	16 kHz
OmniVoice / Chatterbox	Generated output	24 kHz
Qwen3-TTS Base ICL	Generated output	24 kHz
Fish Audio S2 Pro	Generated output	44.1 kHz
VoxCPM2	Generated output	48 kHz

The references are 16 kHz, so they only contain observable evidence up to roughly 8 kHz. Higher-rate outputs can sound more open because they preserve or synthesize more high-band breath, sibilance, and room detail, but 44.1 or 48 kHz output does not automatically mean a better clone. Most speaker identity and intelligibility cues live well below that high band, while excessive high-frequency energy can make a voice feel sharp even when WER and speaker cosine look good.

Reference audio and generated clones

English reference

English

FLEURS test/en_us/1042003289011443756.wav

Reference transcript: The Internet combines elements of both mass and interpersonal communication.

Generated text: This is a short voice cloning benchmark for on-device speech.

OmniVoice

int8 clone from the English reference

Cosine

0.701

ASR

0.0% WER

UTMOS

3.96

RTF

0.48

VoxCPM2

bf16 clone from the English reference

Cosine

0.624

ASR

0.0% WER

UTMOS

3.46

RTF

1.84

Chatterbox Multilingual

fp16 clone from the English reference

Cosine

0.625

ASR

0.0% WER

UTMOS

4.00

RTF

0.94

Fish Audio S2 Pro

fp16 clone from the English reference

Cosine

0.590

ASR

0.0% WER

UTMOS

4.20

RTF

3.15

Qwen3-TTS Base

1.7B bf16 ICL clone from the English reference

Cosine

0.159

ASR

0.0% WER

UTMOS

1.93

RTF

1.82

German reference

German

FLEURS test/de_de/10342213717361642954.wav

Reference transcript: Es ist also möglich, dass der Vermerk einfach als Kennzeichnung hinzugefügt wurde.

Generated text: Dies ist ein kurzer Benchmark für lokale Sprachklonung auf dem Gerät.

OmniVoice

int8 clone from the German reference

Cosine

0.837

ASR

0.0% WER

UTMOS

3.11

RTF

0.50

Qwen3-TTS Base

1.7B bf16 ICL clone from the German reference

Cosine

0.765

ASR

9.1% WER

UTMOS

3.53

RTF

1.78

Fish Audio S2 Pro

fp16 clone from the German reference

Cosine

0.749

ASR

0.0% WER

UTMOS

3.46

RTF

3.08

VoxCPM2

bf16 clone from the German reference

Cosine

0.733

ASR

9.1% WER

UTMOS

2.92

RTF

1.75

Chatterbox Multilingual

fp16 clone from the German reference

Cosine

0.727

ASR

9.1% WER

UTMOS

3.56

RTF

0.87

Modern Standard Arabic reference

Modern Standard Arabic

FLEURS test/ar_eg/10863341459609935739.wav

Reference transcript: لا تشوه الموقع بوضع علامات أو الكتابات الخادشة على الجدران في المباني.

Generated text: هذا اختبار قصير لاستنساخ الصوت على الجهاز.

VoxCPM2

bf16 clone from the Modern Standard Arabic reference

Cosine

0.757

ASR

0.0% WER

UTMOS

2.60

RTF

1.82

Fish Audio S2 Pro

fp16 clone from the Modern Standard Arabic reference

Cosine

0.686

ASR

14.3% WER

UTMOS

3.31

RTF

3.13

Chatterbox Multilingual

fp16 clone from the Modern Standard Arabic reference

Cosine

0.674

ASR

0.0% WER

UTMOS

3.37

RTF

0.88

OmniVoice

int8 clone from the Modern Standard Arabic reference

Cosine

0.621

ASR

0.0% WER

UTMOS

2.99

RTF

0.47

Spanish reference

Spanish

FLEURS test/es_419/16388069031423373053.wav

Reference transcript: Internet une y mezcla componentes propios de la comunicación masiva y entre personas.

Generated text: Esta es una breve prueba de clonación de voz en el dispositivo.

OmniVoice

int8 clone from the Spanish reference

Cosine

0.684

ASR

0.0% WER

UTMOS

2.73

RTF

0.36

Chatterbox Multilingual

fp16 clone from the Spanish reference

Cosine

0.670

ASR

0.0% WER

UTMOS

3.44

RTF

0.92

VoxCPM2

bf16 clone from the Spanish reference

Cosine

0.658

ASR

0.0% WER

UTMOS

2.64

RTF

1.80

Fish Audio S2 Pro

fp16 clone from the Spanish reference

Cosine

0.584

ASR

0.0% WER

UTMOS

2.68

RTF

3.15

Qwen3-TTS Base

1.7B bf16 ICL clone from the Spanish reference

Cosine

0.493

ASR

0.0% WER

UTMOS

2.24

RTF

1.80

Chinese reference

Chinese

FLEURS test/cmn_hans_cn/5479411876618006152.wav

Reference transcript: 互联网结合了大众传播和人际传播的要素。

Generated text: 这是一个简短的本地语音克隆测试。

OmniVoice

int8 clone from the Chinese reference

Cosine

0.690

ASR

0.0% CER

UTMOS

3.11

RTF

0.42

VoxCPM2

bf16 clone from the Chinese reference

Cosine

0.658

ASR

0.0% CER

UTMOS

3.10

RTF

1.90

Qwen3-TTS Base

1.7B bf16 ICL clone from the Chinese reference

Cosine

0.655

ASR

0.0% CER

UTMOS

3.64

RTF

1.80

Fish Audio S2 Pro

fp16 clone from the Chinese reference

Cosine

0.598

ASR

0.0% CER

UTMOS

3.90

RTF

3.20

Method

Dataset references, not hand-picked demos

References are single clips from the Google FLEURS test split: English, German, Arabic, Spanish, and Mandarin Chinese. For engines that accept a reference transcript, the exact FLEURS transcript was passed with the audio prompt.

The score shape mirrors the objective side of VoxCPM-style voice cloning evaluation: intelligibility via WER/CER, and cloning via speaker-embedding cosine similarity. The speaker encoder here is Soniqo’s `speech embed-speaker --engine mlx`, so compare rows inside this table, not against paper SIM percentages directly.

UTMOS is computed with `utmos22_strong` from SpeechMOS after resampling generated clips to 16 kHz. It gives a no-reference naturalness signal that WER and speaker cosine miss, but it is still a model prediction rather than a human listening study.

Qwen3-TTS Base is measured with the public speech-swift ICL API (`Qwen3TTSModel.fromPretrainedWithEncoder` + `synthesizeWithVoiceCloneICL`) using the same FLEURS reference audio and transcript. It is omitted for Arabic because the current Qwen3-TTS language set exposed here does not include Arabic.

Chatterbox’s upstream language list includes Chinese, but the current Swift frontend only supports the direct tokenizer path for `en`, `ar`, `hi`, `de`, `es`, `fr`, `it`, and `pt`; the Chinese row is intentionally omitted until that frontend lands.

# Public speech-swift CLI example for one generated row.
speech speak "$TEXT" \
  --engine voxcpm2 \
  --voxcpm2-variant bf16 \
  --voxcpm2-ref-audio reference.wav \
  --language arabic \
  --output generated.wav

speech embed-speaker reference.wav --engine mlx --json
speech embed-speaker generated.wav --engine mlx --json
speech transcribe generated.wav --engine qwen3 --model 0.6B --language arabic

Why multiple scores?

A clone can sound like the speaker but say the wrong text, or say the text clearly while missing the speaker. It can also match the text and speaker while carrying audible artifacts. Speaker cosine, ASR error, and UTMOS catch different failure modes, so all three need to be visible.

Emotion and style attribution

OmniVoice

Broad style hints

Good when you want a cloned speaker with simple delivery guidance, such as a calmer, younger, lower-pitched, or whispered read. This benchmark used a neutral delivery.

Chatterbox Multilingual

Expressiveness strength

Useful when you want the same speaker to sound more restrained or more animated without writing emotion tags into the text. This benchmark kept expressiveness neutral.

VoxCPM2

Voice direction in plain words

Strong fit when you want to describe the target voice or delivery in natural language while still cloning from a reference clip. This benchmark used the reference clip only.

Fish Audio S2 Pro

Acted delivery cues

Best when the script needs explicit moments like laughing, whispering, excitement, or sadness. This benchmark used plain text with no acting cues.

Qwen3-TTS Base

Reference-audio ICL

Uses the FLEURS clip and transcript as in-context conditioning. This is the comparable Qwen3-TTS path for sample-based cloning in the table.

The benchmark intentionally leaves these controls neutral. That keeps speaker similarity tied to the FLEURS reference instead of rewarding a model for adding extra emotion, whispering, shouting, or laughter.

Reading the result

Treat this as a compact engineering check, not a leaderboard. One reference clip per language is useful for finding obvious regressions, but it is too small to crown a universal winner. OmniVoice is fast and text-stable in this run, but UTMOS shows that quality varies by language instead of following WER or cosine cleanly.

Fish Audio and Chatterbox often score better on UTMOS even when another model has a higher speaker cosine. Qwen3-TTS Base ICL is included for English, German, Spanish, and Chinese; it has exact ASR recovery on three of four rows here, but speaker similarity and quality vary sharply by language. The right reading is per-language tradeoffs, not a single winner.

Try the stack

Speech Studio

Local desktop app for cloning voices and rendering multi-speaker scripts on your machine.

Open Speech Studio

Soniqo Cloud

Hosted endpoint for testing the same speech stack before wiring it into a product.

Open cloud.soniqo.audio

Voice cloning docs VoxCPM evaluation paper