音源分离

Open-Unmix HQ 将一条立体声音乐轨分离为四个独立声轨 — 人声贝斯其他。四个独立的 BiLSTM 模型(每个声轨一个)在混音 STFT 上生成幅度掩码;可选的维纳后滤波器用于协调它们。通过 MLX 在 Apple Silicon 上运行。

提供两个引擎:Open-Unmix HQ(轻量级,默认)和 HTDemucs (Demucs v4) — 一个质量更高的 Hybrid Transformer 模型,可通过 --engine htdemucs 选用。两者均通过 MLX 在 Apple Silicon 上运行,并输出相同的四个声轨,采样率为 44.1 kHz。

概览

架构

Four independent stems, each a copy of the same network:

StageShape / operation
STFT4096-point FFT, 1024-hop, periodic Hann window, reflect-pad. 2049 frequency bins per frame.
Input normalizeCrop to 1487 bins (≈16 kHz), apply learned per-bin mean + scale from training.
EncoderLinear 2974 → 512 + BatchNorm + tanh. Input is 2 channels × 1487 bins.
BiLSTM3 layers, 256 hidden per direction (512 effective). Captures temporal context across frames.
DecoderSkip-concat of encoder and LSTM outputs (1024) → Linear 1024 → 512 + BN + ReLU → Linear 512 → 4098.
Output denorm + maskElement-wise multiply with mixture magnitude; phase from mixture; iSTFT overlap-add.
Wiener (optional)Power-ratio masks across all 4 stem estimates. Refines phase so stems sum to mixture.

模型

ComponentValue
Parameters / stem8.9M
Parameters total (4 stems)~35.6M
Sample rate44.1 kHz stereo
Chunk latencyOffline (full-track STFT)
Weightsaufklarer/OpenUnmix-HQ-MLX (safetensors, ~136 MB)
Upstreamsigsep/open-unmix-pytorch (Stöter et al., JOSS 2019)

HTDemucs (Demucs v4)

For higher separation quality — especially on bass and drums — the package also ships HTDemucs, Meta's Hybrid Transformer Demucs. It merges a spectrogram branch and a waveform branch through a cross-domain transformer; the shipped htdemucs_ft variant is a bag of four fine-tuned sub-models, one per stem. Weights download from HuggingFace on first use. On a directional MUSDB-sample benchmark (museval / BSSEval v4) it averages +3.01 dB SDR over UMX-HQ, with the biggest gains on bass (+5.75 dB).

ComponentValue
Parameters168M (4 × 42M fine-tuned sub-models)
Sample rate44.1 kHz stereo
Windowing7.8 s segments, 25% overlap, triangular cross-fade
Weightsaufklarer/HTDemucs-FT-MLX (fp16, ~320 MB)
Upstreamfacebookresearch/demucs (Rouard et al., ICASSP 2023)

快速开始 — Swift

import SourceSeparation
import AudioCommon

let separator = try await SourceSeparator.fromPretrained()

let stereo = try AudioFileLoader.loadStereo(
    url: URL(fileURLWithPath: "song.wav"),
    targetSampleRate: 44100
)

let stems = separator.separate(audio: stereo, sampleRate: 44100)
// stems[.vocals], stems[.drums], stems[.bass], stems[.other]
// Each is [[Float]] — left channel, right channel.

try WAVWriter.writeStereo(
    left: stems[.vocals]![0],
    right: stems[.vocals]![1],
    sampleRate: 44100,
    to: URL(fileURLWithPath: "vocals.wav")
)

Pass wiener: true (default) for best quality. Pass targets: [.vocals] to extract only a subset of stems and skip the other models.

命令行

speech separate song.wav                              # all 4 stems into song_stems/ (Open-Unmix)
speech separate song.wav --engine htdemucs            # Demucs v4 — higher quality
speech separate song.wav --engine htdemucs --htdemucs-precision int8  # smaller int8 bundle
speech separate song.wav --stems vocals               # vocals only
speech separate song.wav --stems vocals,drums         # subset
speech separate song.wav --output-dir /tmp/stems/     # custom output dir
speech separate song.wav --verbose                    # show timing

何时使用

Open-Unmix 适用于…

…当你需要在 Apple Silicon 应用或流水线中进行轻量级离线音源分离时。每个声轨 890 万参数,保持下载和内存开销适中。对大多数流行/摇滚内容,幅度掩码加维纳能产出优质声轨。若需对工作室素材进行最先进的人声分离,可通过 --engine htdemucs 切换到内置的 HTDemucs (Demucs v4) 引擎;Open-Unmix 则保持在轻量、可随应用一同发布的折衷端。