Qwen3.5 Chat (On-Device LLM)
Qwen3.5-0.8B is a hybrid DeltaNet (linear attention) + GatedAttention model with 24 layers (18 DeltaNet + 6 GatedAttention), quantized to INT4 for MLX (Metal GPU) and INT8 for CoreML (Neural Engine). Runs on Mac via MLX or on iPhone and Mac via CoreML with streaming token generation. Designed for voice pipelines where an on-device LLM provides the "brain" between ASR and TTS.
Qwen3.5 Chat integrates with the SpeechCore VoicePipeline as the LLM component in ASR → LLM → TTS chains. The hybrid DeltaNet architecture provides efficient linear-time attention for long contexts.
Quick Start
import Qwen3Chat
let chat = try await Qwen35MLXChat.fromPretrained()
// Single response
let response = try chat.generate(messages: [
ChatMessage(role: .system, content: "Answer briefly."),
ChatMessage(role: .user, content: "What is Swift?")
])
print(response)
// Streaming tokens
let stream = chat.generateStream(messages: [
ChatMessage(role: .system, content: "Be funny."),
ChatMessage(role: .user, content: "Tell me a joke")
])
for try await token in stream {
print(token, terminator: "")
}
Architecture
Qwen3.5-0.8B is a hybrid model with 24 layers: 18 DeltaNet layers (linear attention with gated delta rule recurrence and RMSNormGated) and 6 GatedAttention layers (standard scaled dot-product attention). The MLX backend runs inference on the Metal GPU with safetensors weights. The CoreML backend uses a dual-model architecture (prefill + decode) optimized for the Neural Engine. Both support KV cache with prompt caching and configurable sampling (temperature, top-k, top-p, repetition penalty).
Model I/O
| Direction | Name | Shape | Description |
|---|---|---|---|
| Input | input_ids | [1, seq_len] | Token IDs (Int32) |
| Input | attention_mask | [1, seq_len] | Attention mask (Int32) |
| Input | kv_cache | per-layer | Key-value cache state |
| Output | logits | [1, 1, 151936] | Next-token logits (Float16) |
| Output | kv_cache_out | per-layer | Updated KV cache |
Model Variants
| Variant | Quantization | Size | Compute | HuggingFace |
|---|---|---|---|---|
| Qwen3.5-0.8B Chat | INT4 | 418 MB | Metal GPU (MLX) | aufklarer/Qwen3.5-0.8B-Chat-MLX |
| Qwen3.5-0.8B Chat | INT8 | 981 MB | Neural Engine (CoreML) | aufklarer/Qwen3.5-0.8B-Chat-CoreML |
Sampling Configuration
let config = ChatSamplingConfig(
temperature: 0.7,
topK: 40,
topP: 0.9,
maxTokens: 128,
repetitionPenalty: 1.1
)
let response = try chat.generate(
messages: [ChatMessage(role: .user, content: "Explain gravity")],
sampling: config
)
| Parameter | Default | Description |
|---|---|---|
temperature | 0.6 | Randomness (0 = greedy, 1 = creative) |
topK | 50 | Keep top K candidates |
topP | 0.95 | Nucleus sampling threshold |
maxTokens | 512 | Max response tokens |
repetitionPenalty | 1.1 | Penalize repeated tokens |
disableThinking | false | Skip thinking mode |
maxThinkingTokens | 100 | Cap thinking tokens |
Multi-turn Conversation
let chat = try await Qwen35MLXChat.fromPretrained()
let history = [
ChatMessage(role: .system, content: "Remember the user's name."),
ChatMessage(role: .user, content: "My name is Alex"),
ChatMessage(role: .assistant, content: "Nice to meet you, Alex!"),
ChatMessage(role: .user, content: "What's my name?")
]
let response = try chat.generate(messages: history)
print(response) // "Your name is Alex!"
chat.resetState() // Clear inference state for a new conversation
Memory Management
// Check memory state
print(chat.isLoaded) // true
print(chat.memoryFootprint) // 438304768 (~418 MB)
// Free memory under pressure
chat.unload()
print(chat.isLoaded) // false
// Reload when needed
let chat = try await Qwen35MLXChat.fromPretrained()
On iPhone, unloading the LLM before TTS inference frees ~418 MB (INT4 MLX) or ~981 MB (INT8 CoreML), preventing jetsam termination when running full ASR → LLM → TTS pipelines.
Performance
| Device | Prefill | Decode | Tokens/sec |
|---|---|---|---|
| M2 Max | ~50ms | ~65ms/tok | ~15 tok/s |
| iPhone 16 Pro | ~1.5s | ~450ms/tok | ~2.2 tok/s |
Conversion
MLX weights are converted from the original Qwen3.5-0.8B checkpoint using the MLX conversion script. CoreML models use a separate conversion script for Neural Engine deployment. Pre-converted weights are available on HuggingFace at aufklarer/Qwen3.5-0.8B-Chat-MLX (INT4: 418 MB) and aufklarer/Qwen3.5-0.8B-Chat-CoreML (INT8: 981 MB).