Hibiki Zero-3B Speech Translation (FR / ES / PT / DE → EN)

Hibiki Zero-3B is Kyutai's streaming speech-to-speech translation model — input is a 24 kHz audio stream in French, Spanish, Portuguese, or German; output is a 24 kHz English audio stream plus a parallel English text transcript at the same 12.5 Hz frame rate. Built on the Moshi/Mimi multistream architecture: a single decoder-only transformer jointly models the source-audio codec stream and the target text+audio streams, so there's no separate ASR + MT + TTS pipeline. The Soniqo build runs as quantized MLX safetensors (INT4 default, INT8 available) entirely on Apple Silicon. CC-BY-4.0.

When to reach for Hibiki vs. ASR + MADLAD

Pipe-style ASR + MADLAD (speech transcribe | speech translate) gets you 400+ languages but adds the round-trip latency of three models. Hibiki is one model end-to-end and preserves prosody — pick it when you need live speech in the target language rather than just text.

Quick Start

import HibikiTranslate
import AudioCommon

let model = try await HibikiTranslateModel.fromPretrained()

let pcm = try AudioFileLoader.load(url: input, targetSampleRate: 24000)
let (englishAudio, textTokens) = model.translate(
    sourceAudio: pcm,
    sourceLanguage: .fr     // .fr / .es / .pt / .de — auto-detected but pass for the metadata
)
try WAVWriter.write(samples: englishAudio, sampleRate: 24000, to: output)

CLI

speech audio-translate input_fr.wav -o out_en.wav --source-lang fr
speech audio-translate input_es.wav -o out_en.wav --source-lang es --quantization 8bit
speech audio-translate input_pt.wav -o out_en.wav --source-lang pt --verbose

# Deterministic mode (used by the CI regression canaries)
HIBIKI_GREEDY=1 speech audio-translate input_fr.wav -o out_en.wav --source-lang fr

# Inner-monologue text token IDs (raw — SPM decode is a follow-up)
speech audio-translate input.wav -o out.wav --transcript

Architecture

Hibiki Zero-3B is a 3.1B-parameter decoder-only multistream transformer. The model jointly attends over 33 streams per frame: one text stream, 16 target-audio codebooks (the agent's output), and 16 source-audio codebooks (the user's input). At each 80 ms frame the model samples one text token plus 16 audio codes via a small 6-layer depformer that runs 16 sub-steps per frame, one per target codebook, with a 9-slice scheduled MultiLinear projection.

The audio codec is Mimi at 12.5 Hz / 16 codebooks. Source audio is encoded into the 16 source-stream codebooks (delay [0, 2, 2, …, 2]); generated target audio fills the 16 target-stream codebooks (same delay pattern); per-codebook un-shift is applied before Mimi decodes the target back to 24 kHz English PCM. The temporal backbone is 28 GQA layers (dim = 2048, 16 query heads, 8 KV heads, kv_repeat = 2, split-half RoPE rope_concat, no conditioner — Zero is the unconditional variant).

Decode Loop

Hibiki emits SPM padding tokens (id 3) while it accumulates enough source context to translate, then content text tokens with matching target audio, and finally text-EOS (id 2). The Swift driver runs until EOS is sampled past the source window, capped at max(tSrc × 5/2, tSrc + 20) steps as a safety bound. Output runs roughly 1.0–1.6× the input duration on FLEURS-style clips; callers should not assume output_duration == input_duration.

The autoregressive feedback path is non-obvious: at step t the transformer reads tokens at cache index step (uniform across all 33 streams, with init-token substitution when step ≤ delays[k]); the sampled text + 16 target codes are written at index step + 1. This mirrors upstream Moshi lm.py where state.offsets += 1 happens before the cache scatter. The text_emb row for EOS (id 2) is aliased to row 3 (PAD) at weight-load time, mirroring Kyutai's loaders.py:312 "implicitly replace early EOS with PAD" patch — any EOS sampled during the audio-streaming window is harmless, only post-source EOS terminates the loop.

Model Variants

VariantQuantizationSizeComputeHuggingFace
Hibiki Zero-3BINT4~2.7 GBMetal GPU (MLX)aufklarer/Hibiki-Zero-3B-MLX-4bit
Hibiki Zero-3BINT8~3.9 GBMetal GPU (MLX)aufklarer/Hibiki-Zero-3B-MLX-8bit

Language Coverage

Hibiki Zero-3B is trained on French, Spanish, Portuguese, and German → English. The Swift driver auto-detects the source language; the --source-lang flag is metadata only.

SourceStatusSample greedy output
FRStrict E2E canary"so it's a ski route." (from "Pensez à l'itinéraire de ski…")
ESStrict E2E canary"gentlemen, the data is worrying." (Hibiki europarl sample)
PTWarn-only (content-faithful, lower keyword recall)"the fifth c is p of the martyr." (FLEURS PT)
DEWarn-only (content-faithful, lower keyword recall)"that didn't seem to me to be useful." (FLEURS DE)
FLEURS Spanish is out-of-distribution

16 kHz human-recorded FLEURS Spanish clips trigger degenerate generation in both the Python upstream and the Swift port (Python emits 1643 steps / ~131 s of broken audio without sampling EOS). The Swift ES regression canary uses a 5 s trimmed excerpt from Kyutai's own samples space (kyutai/hibiki-zero-samples) at 24 kHz TTS-generated audio, which matches the training distribution and produces clean English. If you're feeding Hibiki Spanish in production, pre-resample to 24 kHz and stick to longer clips (5 s+).

Environment Variables

VariableEffect
HIBIKI_GREEDY=1Force argmax decoding for both text and target audio. Reproducible — used by the strict CI canaries.
HIBIKI_E2E=1Enable the E2E test cases (requires the ~2.7 GB model download).
HIBIKI_STRICT_ALL=1Promote PT/DE tests from warn-only to strict.
HIBIKI_LENIENT=1Demote FR/ES tests from strict to warn-only (debugging only).
HIBIKI_MODEL_ID=<repo>Override the default aufklarer/Hibiki-Zero-3B-MLX-4bit model id.

Performance (M2 Max, MLX 4-bit)

MetricGreedySampled
Per-step latency~75 ms~95 ms
Wall-clock for 3.54 s FR source~5 s~7 s
Output duration1.0–1.6× source1.0–1.6× source

Known Limitations

References