Speech Server

speech-server 是一个本地 HTTP + WebSocket 服务器，通过简洁的 REST API 提供所有 Soniqo 模型，并在 /v1/realtime 提供兼容 OpenAI Realtime API 的 WebSocket。它与 speech CLI 打包在同一个 Homebrew bottle 中 —— 执行 brew install speech 即可将两者一并安装到 PATH。

安装与运行

brew install speech

speech-server --port 8080
# Starting server on http://127.0.0.1:8080
# Endpoints:
#   POST /transcribe  - Speech-to-text (WAV body or JSON with audio_base64)
#   POST /speak       - Text-to-speech (JSON: {text, engine?, language?})
#   POST /respond     - Speech-to-speech (WAV body, voice/max_steps via query)
#   POST /enhance     - Speech enhancement (WAV body)
#   GET  /health      - Health check
#   WS   /v1/realtime - OpenAI Realtime API (JSON events, base64 PCM16 audio)

命令行选项

选项	默认值	说明
`--host`	`127.0.0.1`	绑定地址。改为 `0.0.0.0` 可在局域网中开放访问。
`--port`	`8080`	TCP 端口。
`--preload`	关闭	启动时立即加载所有模型。启动变慢（约 30–60 秒），但首次请求零延迟。

模型在首次使用时下载，并缓存在 ~/Library/Caches/qwen3-speech/ 中。对某个模型的第一次请求需要承担下载 + 加载的开销（视模型大小约为 30 秒至 2 分钟）；后续请求即为热请求。

REST 端点

`POST /transcribe` — 语音转文字

接受原始 WAV 请求体，或带有 base64 编码音频的 JSON 包装。

# WAV body (preferred — lower overhead)
curl -X POST http://127.0.0.1:8080/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @recording.wav

# JSON with base64
curl -X POST http://127.0.0.1:8080/transcribe \
  -H "Content-Type: application/json" \
  -d '{"audio_base64":"'"$(base64 -i recording.wav)"'","language":"en"}'

响应：{"text": "…", "language": "en", "confidence": 0.93}。

`POST /speak` — 文本转语音

curl -X POST http://127.0.0.1:8080/speak \
  -H "Content-Type: application/json" \
  -d '{"text":"Hello, world!","engine":"kokoro","language":"en"}' \
  --output hello.wav

响应体为 WAV 数据。支持的 engine 取值：qwen3（默认）、cosyvoice、kokoro。

`POST /respond` — 语音到语音

curl -X POST "http://127.0.0.1:8080/respond?voice=en_female_calm&max_steps=256" \
  -H "Content-Type: audio/wav" \
  --data-binary @question.wav \
  --output answer.wav

运行 PersonaPlex 7B —— 语音输入，语音输出。文字稿通过 X-Response-Text 响应头返回。声音预设名称参见 PersonaPlex 指南。

`POST /enhance` — 语音增强

curl -X POST http://127.0.0.1:8080/enhance \
  -H "Content-Type: audio/wav" \
  --data-binary @noisy.wav \
  --output clean.wav

DeepFilterNet3，48 kHz。如有需要会对输入进行重采样。

`GET /health` — 存活探针

curl http://127.0.0.1:8080/health
# {"status":"ok"}

WebSocket：`/v1/realtime`

与 OpenAI Realtime API 即插即用式兼容 —— 采用相同的 JSON 事件模式（session.update、input_audio_buffer.append、response.create、response.audio.delta 等），音频为 24 kHz、base64 编码的 PCM16。基于 OpenAI Realtime SDK 编写的客户端无需修改代码即可对接 speech-server（只需更换 WebSocket URL）。

JavaScript 示例

const ws = new WebSocket("ws://127.0.0.1:8080/v1/realtime");

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: { modalities: ["audio", "text"] }
  }));

  // Stream PCM16 mono 24kHz audio from the mic:
  const audioBase64 = await capturePCM16Chunk();
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: audioBase64
  }));

  ws.send(JSON.stringify({ type: "response.create" }));
};

ws.onmessage = ev => {
  const msg = JSON.parse(ev.data);
  if (msg.type === "response.audio.delta") {
    playPCM16Base64(msg.delta);
  }
};

Python 示例

import asyncio, base64, json, wave, websockets

async def main():
    async with websockets.connect("ws://127.0.0.1:8080/v1/realtime") as ws:
        await ws.send(json.dumps({"type": "session.update",
                                  "session": {"modalities": ["audio", "text"]}}))

        with wave.open("question.wav", "rb") as wav:
            pcm16 = wav.readframes(wav.getnframes())
        await ws.send(json.dumps({"type": "input_audio_buffer.append",
                                  "audio": base64.b64encode(pcm16).decode()}))
        await ws.send(json.dumps({"type": "response.create"}))

        async for raw in ws:
            msg = json.loads(raw)
            if msg["type"] == "response.audio.delta":
                open("answer.pcm", "ab").write(base64.b64decode(msg["delta"]))
            elif msg["type"] == "response.done":
                break

asyncio.run(main())

部署注意事项

默认无鉴权。speech-server 绑定 127.0.0.1，并信任所有调用方。如需在 localhost 之外开放访问，请将其置于反向代理（Caddy、nginx、tailscale）之后。
模型按需加载。第一次请求 /transcribe 会触发 ASR 模型的下载 + 加载（约 700 MB，30–60 秒）。使用 --preload 可实现零冷启动，但服务器启动会变慢。
暂不支持流式转写。POST /transcribe 要求一次性提供完整的 WAV。流式场景请使用 /v1/realtime。
Systemd / launchd。目前不附带服务单元。直接 nohup speech-server & 或使用你常用的进程管理器即可。

源码

Sources/AudioServer —— 基于 Hummingbird 的 HTTP 路由、按需加载的模型注册表、/v1/realtime WebSocket 处理器。
Sources/AudioServerCLI —— @main 入口、参数解析器。
上游：OpenAI Realtime API 参考。

Speech Server

安装与运行

命令行选项

REST 端点

POST /transcribe — 语音转文字

POST /speak — 文本转语音

POST /respond — 语音到语音

POST /enhance — 语音增强

GET /health — 存活探针

WebSocket：/v1/realtime