Speech Restoration — Sidon

Restore noisy, reverberant, or band-limited speech to clean 48 kHz audio with Sidon — a single model that combines denoising, dereverberation, and bandwidth extension. It runs entirely on-device (CoreML on Apple Silicon, ONNX Runtime elsewhere). Because it reconstructs speech from learned representations rather than just masking noise, it is especially good at preparing a voice-cloning reference: it cleans the recording while preserving the speaker's identity.

When to use Sidon vs. DeepFilterNet3

Speech Enhancement (DeepFilterNet3) is a tiny, real-time noise suppressor. Sidon is a heavier generative restoration model: it also removes reverberation and rebuilds high-frequency detail to 48 kHz. Use DeepFilterNet3 for live noise removal, Sidon for offline cleanup of references and archival recordings.

Architecture

Sidon is a two-stage pipeline: a self-supervised feature predictor cleanses the speech representation, and a neural vocoder resynthesizes a clean waveform from it.

StageDetails
Front-endw2v-BERT 2.0 SeamlessM4T log-mel features (16 kHz → 160-dim)
Predictorw2v-BERT 2.0 (8 layers) with a LoRA-fine-tuned cleanse head → cleansed features
VocoderDAC decoder resynthesizes 48 kHz audio from the cleansed features

The pipeline is 16 kHz audio → features → predictor → DAC decoder → 48 kHz audio. Total ≈ 246M parameters (193.6M predictor + 52.4M vocoder).

Processing Pipeline

  1. Feature extraction — Compute the w2v-BERT 2.0 log-mel features from the 16 kHz input (Accelerate/vDSP on Apple, C++ on other platforms)
  2. Predictor — The LoRA-adapted w2v-BERT encoder maps noisy/reverberant features to clean ones
  3. Vocoder — The DAC decoder reconstructs a clean 48 kHz waveform from the cleansed features
  4. Chunking — Longer audio is processed in fixed windows (~10 s) and stitched on the 48 kHz timeline

Quality

On a reverberant reference clip, restoration lifts perceptual quality while keeping speaker identity intact (no-reference MOS):

AudioDNSMOS OVRLUTMOSSpeaker cosine
Input (reverberant)2.902.99
Sidon restored3.293.400.79

The largest gain is in the background score (reverberation removed). Speaker similarity is preserved, which is what matters when cleaning a cloning reference.

Model Variants

Quantization compresses the predictor; the DAC vocoder stays at higher precision (audio quality). On Apple, int8 uses k-means palettization; on ONNX, int8 is weight-only per-channel.

FormatPrecisionBundle size
CoreMLint8 (predictor) + FP16 (vocoder)~407 MB
CoreMLFP16~713 MB
ONNXint8 (predictor) + FP16 (vocoder)~286 MB
ONNXFP16~470 MB
ONNXFP32~939 MB

CLI Usage

# Restore audio (denoise + dereverb) to clean 48 kHz
.build/release/speech restore noisy.wav -o clean.wav

# Clean a voice-cloning reference before TTS
.build/release/speech speak "Hello world" --voice-sample ref.wav --clean-reference
Important

Sidon outputs 48 kHz audio regardless of the input sample rate (it upsamples and restores bandwidth). It is an offline restoration model — heavier than DeepFilterNet3 — and is best run on a file rather than a live stream.

Model Downloads

ModelFormatHuggingFace
Sidon (CoreML)fp16 + int8aufklarer/Sidon-CoreML
Sidon (ONNX)int8 + fp16 + fp32soniqo/Sidon-ONNX

Combining with Other Models

Sidon is most useful as a preprocessing step:

Swift API

import SpeechRestoration

let restorer = try await SpeechRestorer.fromPretrained()
let cleanAudio = try restorer.restore(audio: noisySamples, sampleRate: 16000)

Also available on Android, Linux & Windows via Speech Core (ONNX Runtime). Built on Sidon (MIT).