Qwen3.5 Chat (ऑन-डिवाइस LLM)

Qwen3.5-0.8B एक हाइब्रिड DeltaNet (linear attention) + GatedAttention मॉडल है जिसमें 24 layers (18 DeltaNet + 6 GatedAttention) हैं, MLX (Metal GPU) के लिए INT4 और CoreML (Neural Engine) के लिए INT8 में क्वांटाइज़ किया गया। Mac पर MLX के माध्यम से या iPhone और Mac पर CoreML के माध्यम से streaming token generation के साथ चलता है। वॉयस पाइपलाइन के लिए डिज़ाइन किया गया जहाँ एक ऑन-डिवाइस LLM ASR और TTS के बीच "ब्रेन" प्रदान करता है।

वॉयस पाइपलाइन तैयार

Qwen3.5 Chat ASR → LLM → TTS चेन में LLM कॉम्पोनेंट के रूप में SpeechCore VoicePipeline के साथ इंटीग्रेट होता है। हाइब्रिड DeltaNet आर्किटेक्चर लंबे संदर्भों के लिए कुशल linear-time attention प्रदान करता है।

त्वरित प्रारंभ

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()

// Single response
let response = try chat.generate(messages: [
    ChatMessage(role: .system, content: "Answer briefly."),
    ChatMessage(role: .user, content: "What is Swift?")
])
print(response)

// Streaming tokens
let stream = chat.generateStream(messages: [
    ChatMessage(role: .system, content: "Be funny."),
    ChatMessage(role: .user, content: "Tell me a joke")
])
for try await token in stream {
    print(token, terminator: "")
}

आर्किटेक्चर

Qwen3.5-0.8B 24 layers वाला एक हाइब्रिड मॉडल है: 18 DeltaNet layers (gated delta rule recurrence और RMSNormGated के साथ linear attention) और 6 GatedAttention layers (standard scaled dot-product attention)। MLX बैकएंड safetensors weights के साथ Metal GPU पर इन्फ़रेंस चलाता है। CoreML बैकएंड Neural Engine के लिए ऑप्टिमाइज़ किए गए dual-model आर्किटेक्चर (prefill + decode) का उपयोग करता है। दोनों prompt caching और कॉन्फ़िगर करने योग्य sampling (temperature, top-k, top-p, repetition penalty) के साथ KV cache का समर्थन करते हैं।

मॉडल I/O

दिशा	नाम	आकार	विवरण
इनपुट	`input_ids`	[1, seq_len]	Token IDs (Int32)
इनपुट	`attention_mask`	[1, seq_len]	Attention mask (Int32)
इनपुट	`kv_cache`	per-layer	Key-value cache state
आउटपुट	`logits`	[1, 1, 151936]	Next-token logits (Float16)
आउटपुट	`kv_cache_out`	per-layer	Updated KV cache

मॉडल वेरिएंट

वेरिएंट	क्वांटिज़ेशन	आकार	कंप्यूट	HuggingFace
Qwen3.5-0.8B Chat	INT4	418 MB	Metal GPU (MLX)	aufklarer/Qwen3.5-0.8B-Chat-MLX
Qwen3.5-0.8B Chat	INT8	981 MB	Neural Engine (CoreML)	aufklarer/Qwen3.5-0.8B-Chat-CoreML

Sampling कॉन्फ़िगरेशन

let config = ChatSamplingConfig(
    temperature: 0.7,
    topK: 40,
    topP: 0.9,
    maxTokens: 128,
    repetitionPenalty: 1.1
)
let response = try chat.generate(
    messages: [ChatMessage(role: .user, content: "Explain gravity")],
    sampling: config
)

पैरामीटर	डिफ़ॉल्ट	विवरण
`temperature`	0.6	Randomness (0 = greedy, 1 = creative)
`topK`	50	टॉप K candidates रखें
`topP`	0.95	Nucleus sampling threshold
`maxTokens`	512	अधिकतम response tokens
`repetitionPenalty`	1.1	दोहराए गए tokens को दंडित करें
`disableThinking`	false	Thinking mode छोड़ें
`maxThinkingTokens`	100	Thinking tokens सीमा

Multi-turn Conversation

let chat = try await Qwen35MLXChat.fromPretrained()

let history = [
    ChatMessage(role: .system, content: "Remember the user's name."),
    ChatMessage(role: .user, content: "My name is Alex"),
    ChatMessage(role: .assistant, content: "Nice to meet you, Alex!"),
    ChatMessage(role: .user, content: "What's my name?")
]
let response = try chat.generate(messages: history)
print(response)  // "Your name is Alex!"

chat.resetState()  // Clear inference state for a new conversation

मेमोरी प्रबंधन

// Check memory state
print(chat.isLoaded)        // true
print(chat.memoryFootprint) // 438304768 (~418 MB)

// Free memory under pressure
chat.unload()
print(chat.isLoaded)        // false

// Reload when needed
let chat = try await Qwen35MLXChat.fromPretrained()

iOS मेमोरी टिप

iPhone पर, TTS इन्फ़रेंस से पहले LLM को अनलोड करने से ~418 MB (INT4 MLX) या ~981 MB (INT8 CoreML) फ़्री हो जाते हैं, जो पूर्ण ASR → LLM → TTS पाइपलाइन चलाते समय jetsam termination को रोकते हैं।

परफ़ॉर्मेंस

डिवाइस	Prefill	Decode	Tokens/sec
M2 Max	~50ms	~65ms/tok	~15 tok/s
iPhone 16 Pro	~1.5s	~450ms/tok	~2.2 tok/s

कन्वर्ज़न

MLX weights MLX कन्वर्ज़न स्क्रिप्ट का उपयोग करके मूल Qwen3.5-0.8B checkpoint से कन्वर्ट किए जाते हैं। CoreML मॉडल Neural Engine डिप्लॉयमेंट के लिए एक अलग कन्वर्ज़न स्क्रिप्ट का उपयोग करते हैं। पूर्व-कन्वर्ट किए गए weights HuggingFace पर aufklarer/Qwen3.5-0.8B-Chat-MLX (INT4: 418 MB) और aufklarer/Qwen3.5-0.8B-Chat-CoreML (INT8: 981 MB) पर उपलब्ध हैं।