Speech Restoration — Sidon
Restore noisy, reverberant, or band-limited speech to clean 48 kHz audio with Sidon — a single model that combines denoising, dereverberation, and bandwidth extension. It runs entirely on-device (CoreML on Apple Silicon, ONNX Runtime elsewhere). Because it reconstructs speech from learned representations rather than just masking noise, it is especially good at preparing a voice-cloning reference: it cleans the recording while preserving the speaker's identity.
Speech Enhancement (DeepFilterNet3) is a tiny, real-time noise suppressor. Sidon is a heavier generative restoration model: it also removes reverberation and rebuilds high-frequency detail to 48 kHz. Use DeepFilterNet3 for live noise removal, Sidon for offline cleanup of references and archival recordings.
Architecture
Sidon is a two-stage pipeline: a self-supervised feature predictor cleanses the speech representation, and a neural vocoder resynthesizes a clean waveform from it.
| Stage | Details |
|---|---|
| Front-end | w2v-BERT 2.0 SeamlessM4T log-mel features (16 kHz → 160-dim) |
| Predictor | w2v-BERT 2.0 (8 layers) with a LoRA-fine-tuned cleanse head → cleansed features |
| Vocoder | DAC decoder resynthesizes 48 kHz audio from the cleansed features |
The pipeline is 16 kHz audio → features → predictor → DAC decoder → 48 kHz audio. Total ≈ 246M parameters (193.6M predictor + 52.4M vocoder).
Processing Pipeline
- Feature extraction — Compute the w2v-BERT 2.0 log-mel features from the 16 kHz input (Accelerate/vDSP on Apple, C++ on other platforms)
- Predictor — The LoRA-adapted w2v-BERT encoder maps noisy/reverberant features to clean ones
- Vocoder — The DAC decoder reconstructs a clean 48 kHz waveform from the cleansed features
- Chunking — Longer audio is processed in fixed windows (~10 s) and stitched on the 48 kHz timeline
Quality
On a reverberant reference clip, restoration lifts perceptual quality while keeping speaker identity intact (no-reference MOS):
| Audio | DNSMOS OVRL | UTMOS | Speaker cosine |
|---|---|---|---|
| Input (reverberant) | 2.90 | 2.99 | — |
| Sidon restored | 3.29 | 3.40 | 0.79 |
The largest gain is in the background score (reverberation removed). Speaker similarity is preserved, which is what matters when cleaning a cloning reference.
Model Variants
Quantization compresses the predictor; the DAC vocoder stays at higher precision (audio quality). On Apple, int8 uses k-means palettization; on ONNX, int8 is weight-only per-channel.
| Format | Precision | Bundle size |
|---|---|---|
| CoreML | int8 (predictor) + FP16 (vocoder) | ~407 MB |
| CoreML | FP16 | ~713 MB |
| ONNX | int8 (predictor) + FP16 (vocoder) | ~286 MB |
| ONNX | FP16 | ~470 MB |
| ONNX | FP32 | ~939 MB |
CLI Usage
# Restore audio (denoise + dereverb) to clean 48 kHz
.build/release/speech restore noisy.wav -o clean.wav
# Clean a voice-cloning reference before TTS
.build/release/speech speak "Hello world" --voice-sample ref.wav --clean-reference
Sidon outputs 48 kHz audio regardless of the input sample rate (it upsamples and restores bandwidth). It is an offline restoration model — heavier than DeepFilterNet3 — and is best run on a file rather than a live stream.
Model Downloads
| Model | Format | HuggingFace |
|---|---|---|
| Sidon (CoreML) | fp16 + int8 | aufklarer/Sidon-CoreML |
| Sidon (ONNX) | int8 + fp16 + fp32 | soniqo/Sidon-ONNX |
Combining with Other Models
Sidon is most useful as a preprocessing step:
- Before voice cloning — Clean a noisy/reverberant reference so the clone inherits the voice, not the room
- Before transcription — Restore archival or far-field recordings to improve ASR accuracy
- Before speaker embedding — Cleaner audio yields more reliable embeddings
Swift API
import SpeechRestoration
let restorer = try await SpeechRestorer.fromPretrained()
let cleanAudio = try restorer.restore(audio: noisySamples, sampleRate: 16000)
Also available on Android, Linux & Windows via Speech Core (ONNX Runtime). Built on Sidon (MIT).