Speaker Diarization
Identify who spoke when in a multi-speaker recording. Two diarization engines are available: a two-stage Pyannote pipeline (segmentation + activity-based speaker chaining, then post-hoc embedding) and an end-to-end Sortformer model (CoreML, Neural Engine).
Engines
Select the engine with --engine pyannote (default) or --engine sortformer.
Pyannote (default)
Two-stage pipeline: Pyannote segmentation processes overlapping windows with activity-based speaker chaining (Pearson correlation in overlap zones) to assign global speaker labels. Post-hoc WeSpeaker embedding extraction enables target speaker identification via enrollment audio.
Sortformer (CoreML)
NVIDIA's end-to-end neural diarization model. Directly predicts per-frame speaker activity for up to 4 speakers without separate embedding or clustering stages. Runs on Neural Engine via CoreML with streaming state buffers (FIFO + speaker cache).
Sortformer does not produce speaker embeddings. The --target-speaker and --embedding-engine flags are only available with the Pyannote engine.
Pyannote Pipeline
The default pipeline runs in two stages:
Stage 1: Segmentation + Speaker Chaining
Pyannote segmentation-3.0 processes 10-second sliding windows with 50% overlap. A powerset decoder converts the 7-class output into per-speaker probabilities (up to 3 local speakers per window). Adjacent windows share a 5-second overlap — speaker identity is propagated across windows by computing Pearson correlation between probability tracks in the overlap zone, with greedy exclusive matching for consistent global speaker IDs.
Stage 2: Post-hoc Embedding
After diarization, WeSpeaker ResNet34-LM extracts a 256-dimensional centroid embedding per speaker. These embeddings enable target speaker extraction (--target-speaker) but do not drive the speaker assignment itself.
Migrating from pyannote.audio
If you are coming from the Python pyannote.audio library — replacing a Pipeline subclass that sets pipeline.segmentation = ..., or moving away from a server hosting pyannote/speaker-diarization-3.1 — Soniqo wraps the same Pyannote-Segmentation-3.0 model and runs it entirely on-device on Apple Silicon. No Python runtime, no CUDA, no Hugging Face token at inference time.
| pyannote.audio (Python) | Soniqo (Swift) |
|---|---|
Pipeline.from_pretrained("pyannote/speaker-diarization-3.1") |
DiarizationPipeline.fromPretrained() |
pipeline(audio_file) |
pipeline.diarize(audio: samples, sampleRate: 16000) |
pipeline.segmentation = ... (custom subclass) |
Fixed: Pyannote-Segmentation-3.0 (MLX or CoreML, auto-selected) |
diarization.itertracks(yield_label=True) |
for seg in result.segments { ... } |
diarization.write_rttm(file) |
CLI: --rttm |
pyannote.metrics.diarization.DiarizationErrorRate |
CLI: --score-against reference.rttm |
The Pyannote-Segmentation-3.0 weights are converted from the upstream HuggingFace checkpoint, so segmentation logits are numerically equivalent within float-precision tolerance. The post-segmentation chaining (Pearson correlation across overlapping windows + greedy exclusive matching) and post-hoc WeSpeaker embedding stages are reimplemented in Swift but produce comparable RTTM output to the reference Python pipeline.
There is no streaming OnlineSpeakerDiarization equivalent for the Pyannote engine. For real-time diarization use --engine sortformer instead, which runs the Sortformer model with FIFO and speaker-cache state buffers.
CLI Usage
# Basic diarization (pyannote, default)
.build/release/speech diarize meeting.wav
# End-to-end Sortformer (CoreML)
.build/release/speech diarize meeting.wav --engine sortformer
# RTTM output format (for evaluation)
.build/release/speech diarize meeting.wav --rttm
# JSON output
.build/release/speech diarize meeting.wav --json
Target Speaker Extraction
Provide enrollment audio of a known speaker to extract only their segments from a recording. The pipeline computes the speaker embedding of the enrollment audio and finds the cluster with the highest cosine similarity.
# Extract segments for a specific speaker
.build/release/speech diarize meeting.wav --target-speaker enrollment.wav
DER Scoring
Evaluate diarization quality by scoring against a reference RTTM file. The pipeline computes the Diarization Error Rate (DER), which measures the proportion of time that is incorrectly attributed.
# Score against reference RTTM
.build/release/speech diarize meeting.wav --score-against reference.rttm
RTTM Output
The --rttm flag produces Rich Transcription Time Marked output, a standard format used for diarization evaluation. Each line follows the format:
SPEAKER filename 1 start_time duration <NA> <NA> speaker_id <NA> <NA>
Options
| Option | Description |
|---|---|
--target-speaker | Enrollment audio for target speaker extraction (pyannote only) |
--embedding-engine | Speaker embedding engine: mlx or coreml (pyannote only) |
--vad-filter | Pre-filter with Silero VAD (pyannote only) |
--rttm | Output in RTTM format |
--json | JSON output format |
--score-against | Reference RTTM file for DER evaluation |
Diarization works best with recordings that have clear speaker turns. Highly overlapping speech may reduce accuracy. Speaker count is determined automatically.
Model Downloads
Models are downloaded automatically on first use:
| Component | Model | Size | HuggingFace |
|---|---|---|---|
| Segmentation | Pyannote-Segmentation-3.0 | ~5.7 MB | aufklarer/Pyannote-Segmentation-MLX |
| Speaker Embedding | WeSpeaker-ResNet34-LM (MLX) | ~25 MB | aufklarer/WeSpeaker-ResNet34-LM-MLX |
| Speaker Embedding | WeSpeaker-ResNet34-LM (CoreML) | ~25 MB | aufklarer/WeSpeaker-ResNet34-LM-CoreML |
| Sortformer | Sortformer Diarization (CoreML) | ~240 MB | aufklarer/Sortformer-Diarization-CoreML |
Swift API
import SpeechVAD
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
print("Speaker \(seg.speakerId): [\(seg.startTime)s - \(seg.endTime)s]")
}
// Target speaker extraction
let targetEmb = pipeline.embeddingModel.embed(audio: enrollmentAudio, sampleRate: 16000)
let segments = pipeline.extractSpeaker(
audio: meetingAudio, sampleRate: 16000,
targetEmbedding: targetEmb
)