소스 분리

Open-Unmix HQ 는 스테레오 음악 트랙을 4 개의 독립 스템 — 보컬, 드럼, 베이스, 기타 로 분리합니다. 각 스템마다 독립된 BiLSTM 모델이 혼합 STFT 에 대해 크기 마스크를 생성하며, 선택적 Wiener 후처리로 이들을 조정합니다. Apple Silicon 에서 MLX로 동작합니다.

두 가지 엔진을 사용할 수 있습니다: Open-Unmix HQ(경량, 기본값)와 HTDemucs (Demucs v4) — --engine htdemucs 로 선택하는 더 높은 품질의 Hybrid Transformer 모델입니다. 둘 다 MLX 를 통해 Apple Silicon 에서 동작하며 동일한 4 개의 스템을 44.1 kHz 로 출력합니다.

개요

4 stems per track — vocals, drums, bass, other. Each is a 2-channel 44.1 kHz stem file.
Magnitude-mask model — each stem model predicts a non-negative mask applied to the mixture spectrogram; phase is taken from the mixture.
Wiener post-filter (optional) — soft-mask refinement across all 4 stems so they sum coherently back to the mixture. Adds ~0.5 dB SDR.
Small footprint — 8.9M params per stem, ~136 MB total for all 4 stems.
Apache-2.0 — upstream weights under MIT, our CoreML/MLX conversion under Apache-2.0.

아키텍처

Four independent stems, each a copy of the same network:

Stage	Shape / operation
STFT	4096-point FFT, 1024-hop, periodic Hann window, reflect-pad. 2049 frequency bins per frame.
Input normalize	Crop to 1487 bins (≈16 kHz), apply learned per-bin mean + scale from training.
Encoder	Linear 2974 → 512 + BatchNorm + tanh. Input is 2 channels × 1487 bins.
BiLSTM	3 layers, 256 hidden per direction (512 effective). Captures temporal context across frames.
Decoder	Skip-concat of encoder and LSTM outputs (1024) → Linear 1024 → 512 + BN + ReLU → Linear 512 → 4098.
Output denorm + mask	Element-wise multiply with mixture magnitude; phase from mixture; iSTFT overlap-add.
Wiener (optional)	Power-ratio masks across all 4 stem estimates. Refines phase so stems sum to mixture.

모델

Component	Value
Parameters / stem	8.9M
Parameters total (4 stems)	~35.6M
Sample rate	44.1 kHz stereo
Chunk latency	Offline (full-track STFT)
Weights	aufklarer/OpenUnmix-HQ-MLX (safetensors, ~136 MB)
Upstream	sigsep/open-unmix-pytorch (Stöter et al., JOSS 2019)

HTDemucs (Demucs v4)

For higher separation quality — especially on bass and drums — the package also ships HTDemucs, Meta's Hybrid Transformer Demucs. It merges a spectrogram branch and a waveform branch through a cross-domain transformer; the shipped htdemucs_ft variant is a bag of four fine-tuned sub-models, one per stem. Weights download from HuggingFace on first use. On a directional MUSDB-sample benchmark (museval / BSSEval v4) it averages +3.01 dB SDR over UMX-HQ, with the biggest gains on bass (+5.75 dB).

Component	Value
Parameters	168M (4 × 42M fine-tuned sub-models)
Sample rate	44.1 kHz stereo
Windowing	7.8 s segments, 25% overlap, triangular cross-fade
Weights	aufklarer/HTDemucs-FT-MLX (fp16, ~320 MB)
Upstream	facebookresearch/demucs (Rouard et al., ICASSP 2023)

빠른 시작 — Swift

import SourceSeparation
import AudioCommon

let separator = try await SourceSeparator.fromPretrained()

let stereo = try AudioFileLoader.loadStereo(
    url: URL(fileURLWithPath: "song.wav"),
    targetSampleRate: 44100
)

let stems = separator.separate(audio: stereo, sampleRate: 44100)
// stems[.vocals], stems[.drums], stems[.bass], stems[.other]
// Each is [[Float]] — left channel, right channel.

try WAVWriter.writeStereo(
    left: stems[.vocals]![0],
    right: stems[.vocals]![1],
    sampleRate: 44100,
    to: URL(fileURLWithPath: "vocals.wav")
)

Pass wiener: true (default) for best quality. Pass targets: [.vocals] to extract only a subset of stems and skip the other models.

커맨드라인

speech separate song.wav                              # all 4 stems into song_stems/ (Open-Unmix)
speech separate song.wav --engine htdemucs            # Demucs v4 — higher quality
speech separate song.wav --engine htdemucs --htdemucs-precision int8  # smaller int8 bundle
speech separate song.wav --stems vocals               # vocals only
speech separate song.wav --stems vocals,drums         # subset
speech separate song.wav --output-dir /tmp/stems/     # custom output dir
speech separate song.wav --verbose                    # show timing

언제 사용하는가

Open-Unmix 가 적합한 때…

…Apple Silicon 앱이나 파이프라인 안에서 가벼운 오프라인 소스 분리를 원할 때. 스템당 890 만 파라미터로 다운로드와 메모리 부담이 적습니다. 크기 마스킹 + Wiener 는 대부분의 팝/록 콘텐츠에서 양질의 스템을 만들어 줍니다. 스튜디오 자료에서 최신 수준의 보컬 분리를 하려면 --engine htdemucs 로 내장된 HTDemucs (Demucs v4) 엔진으로 전환하세요. Open-Unmix 는 가볍고 앱에 그대로 탑재할 수 있는 절충점 쪽에 남아 있습니다.