Speech Restoration — Sidon

Restore noisy, reverberant, or band-limited speech to clean 48 kHz audio with Sidon — a single model that combines denoising, dereverberation, and bandwidth extension. It runs entirely on-device (CoreML on Apple Silicon, ONNX Runtime elsewhere). Because it reconstructs speech from learned representations rather than just masking noise, it is especially good at preparing a voice-cloning reference: it cleans the recording while preserving the speaker's identity.

When to use Sidon vs. DeepFilterNet3

Speech Enhancement (DeepFilterNet3) is a tiny, real-time noise suppressor. Sidon is a heavier generative restoration model: it also removes reverberation and rebuilds high-frequency detail to 48 kHz. Use DeepFilterNet3 for live noise removal, Sidon for offline cleanup of references and archival recordings.

Architecture

Sidon is a two-stage pipeline: a self-supervised feature predictor cleanses the speech representation, and a neural vocoder resynthesizes a clean waveform from it.

Stage	Details
Front-end	w2v-BERT 2.0 SeamlessM4T log-mel features (16 kHz → 160-dim)
Predictor	w2v-BERT 2.0 (8 layers) with a LoRA-fine-tuned cleanse head → cleansed features
Vocoder	DAC decoder resynthesizes 48 kHz audio from the cleansed features

The pipeline is 16 kHz audio → features → predictor → DAC decoder → 48 kHz audio. Total ≈ 246M parameters (193.6M predictor + 52.4M vocoder).

Processing Pipeline

Feature extraction — Compute the w2v-BERT 2.0 log-mel features from the 16 kHz input (Accelerate/vDSP on Apple, C++ on other platforms)
Predictor — The LoRA-adapted w2v-BERT encoder maps noisy/reverberant features to clean ones
Vocoder — The DAC decoder reconstructs a clean 48 kHz waveform from the cleansed features
Chunking — Longer audio is processed in fixed windows (~10 s) and stitched on the 48 kHz timeline

Quality

On a reverberant reference clip, restoration lifts perceptual quality while keeping speaker identity intact (no-reference MOS):

Audio	DNSMOS OVRL	UTMOS	Speaker cosine
Input (reverberant)	2.90	2.99	—
Sidon restored	3.29	3.40	0.79

The largest gain is in the background score (reverberation removed). Speaker similarity is preserved, which is what matters when cleaning a cloning reference.

Model Variants

Quantization compresses the predictor; the DAC vocoder stays at higher precision (audio quality). On Apple, int8 uses k-means palettization; on ONNX, int8 is weight-only per-channel.

Format	Precision	Bundle size
CoreML	int8 (predictor) + FP16 (vocoder)	~407 MB
CoreML	FP16	~713 MB
ONNX	int8 (predictor) + FP16 (vocoder)	~286 MB
ONNX	FP16	~470 MB
ONNX	FP32	~939 MB

CLI Usage

# Restore audio (denoise + dereverb) to clean 48 kHz
.build/release/speech restore noisy.wav -o clean.wav

# Clean a voice-cloning reference before TTS
.build/release/speech speak "Hello world" --voice-sample ref.wav --clean-reference

Important

Sidon outputs 48 kHz audio regardless of the input sample rate (it upsamples and restores bandwidth). It is an offline restoration model — heavier than DeepFilterNet3 — and is best run on a file rather than a live stream.

Model Downloads

Model	Format	HuggingFace
Sidon (CoreML)	fp16 + int8	aufklarer/Sidon-CoreML
Sidon (ONNX)	int8 + fp16 + fp32	soniqo/Sidon-ONNX

Combining with Other Models

Sidon is most useful as a preprocessing step:

Before voice cloning — Clean a noisy/reverberant reference so the clone inherits the voice, not the room
Before transcription — Restore archival or far-field recordings to improve ASR accuracy
Before speaker embedding — Cleaner audio yields more reliable embeddings

Swift API

import SpeechRestoration

let restorer = try await SpeechRestorer.fromPretrained()
let cleanAudio = try restorer.restore(audio: noisySamples, sampleRate: 16000)

Also available on Android, Linux & Windows via Speech Core (ONNX Runtime). Built on Sidon (MIT).