Update Chinese and English documentation for custom board setup and MCP protocol

- Added a comprehensive guide for creating custom boards in the XiaoZhi AI project, detailing directory structure, configuration files, and initialization code.
- Introduced a new document explaining the MCP protocol for IoT control, including message formats and interaction flows.
- Updated existing documentation to reflect changes in tool registration and usage examples for the MCP protocol.
- Enhanced README files for better clarity and consistency across languages.
This commit is contained in:
Terrence
2026-04-17 03:36:37 +08:00
parent 69b1a978e9
commit 87f6faee79
15 changed files with 2910 additions and 919 deletions

View File

@ -1,32 +1,33 @@
以下是一份基于代码实现整理的 WebSocket 通信协议文档,概述设备端与服务器之间如何通过 WebSocket 进行交互。
# WebSocket Communication Protocol
该文档仅基于所提供的代码推断,实际部署时可能需要结合服务器端实现进行进一步确认或补充。
This document describes the WebSocket communication protocol between the device and the server, based on the current code. When implementing a server, please cross-check with the actual implementation.
---
## 1. 总体流程概览
## 1. Overall Flow
1. **设备端初始化**
- 设备上电、初始化 `Application`
- 初始化音频编解码器、显示屏、LED 等
- 连接网络
- 创建并初始化实现 `Protocol` 接口的 WebSocket 协议实例(`WebsocketProtocol`
- 进入主循环等待事件(音频输入、音频输出、调度任务等)。
1. **Device initialization**
- The device boots and initializes `Application`:
- Initializes the audio codec, display, LEDs, etc.
- Connects to the network.
- Creates a WebSocket protocol instance (`WebsocketProtocol`) that implements the `Protocol` interface.
- Enters the main loop and waits for events (audio input, audio output, scheduled tasks, etc.).
2. **建立 WebSocket 连接**
- 当设备需要开始语音会话时(例如用户唤醒、手动按键触发等),调用 `OpenAudioChannel()`
- 根据配置获取 WebSocket URL
- 设置若干请求头(`Authorization`, `Protocol-Version`, `Device-Id`, `Client-Id`
- 调用 `Connect()` 与服务器建立 WebSocket 连接
2. **Opening the WebSocket connection**
- When the device needs to start a voice session (wake-up, button press, etc.), it calls `OpenAudioChannel()`:
- Reads the WebSocket URL from settings.
- Sets the request headers (`Authorization`, `Protocol-Version`, `Device-Id`, `Client-Id`).
- Calls `Connect()` to establish the WebSocket connection.
3. **设备端发送 "hello" 消息**
- 连接成功后,设备会发送一条 JSON 消息,示例结构如下:
3. **Device sends a "hello" message**
- Once connected, the device sends a JSON message. Example:
```json
{
"type": "hello",
"version": 1,
"features": {
"mcp": true
"mcp": true,
"aec": true
},
"transport": "websocket",
"audio_params": {
@ -37,13 +38,13 @@
}
}
```
- 其中 `features` 字段为可选,内容根据设备编译配置自动生成。例如:`"mcp": true` 表示支持 MCP 协议。
- `frame_duration` 的值对应 `OPUS_FRAME_DURATION_MS`(例如 60ms)。
- `features` is optional and generated from compile-time configuration. For example, `"mcp": true` means the device supports MCP, and `"aec": true` is emitted when `CONFIG_USE_SERVER_AEC` is enabled.
- `frame_duration` matches `OPUS_FRAME_DURATION_MS` (typically 60 ms).
4. **服务器回复 "hello"**
- 设备等待服务器返回一条包含 `"type": "hello"` 的 JSON 消息,并检查 `"transport": "websocket"` 是否匹配。
- 服务器可选下发 `session_id` 字段,设备端收到后会自动记录。
- 示例:
4. **Server replies with "hello"**
- The device waits for a JSON message whose `"type"` is `"hello"` and whose `"transport"` is `"websocket"`.
- The server may include a `session_id`; the device will store it.
- Example:
```json
{
"type": "hello",
@ -57,89 +58,90 @@
}
}
```
- 如果匹配,则认为服务器已就绪,标记音频通道打开成功。
- 如果在超时时间(默认 10 秒)内未收到正确回复,认为连接失败并触发网络错误回调。
- If `transport` matches, the device marks the audio channel as opened.
- If no valid hello arrives within the timeout (default 10 seconds), the connection is considered failed and the network error callback is fired.
5. **后续消息交互**
- 设备端和服务器端之间可发送两种主要类型的数据:
1. **二进制音频数据**Opus 编码)
2. **文本 JSON 消息**用于传输聊天状态、TTS/STT 事件、MCP 协议消息等)
5. **Subsequent exchanges**
- Two kinds of data are sent in either direction:
1. **Binary audio data** (Opus encoded)
2. **Text JSON messages** (chat state, TTS/STT events, MCP messages, etc.)
- 在代码里,接收回调主要分为:
- `OnData(...)`:
- `binary` `true` 时,认为是音频帧;设备会将其当作 Opus 数据进行解码。
- `binary` `false` 时,认为是 JSON 文本,需要在设备端用 cJSON 进行解析并做相应业务逻辑处理如聊天、TTS、MCP 协议消息等)。
- In the code, the receive callback splits traffic as follows:
- `OnData(...)`:
- If `binary` is `true`, the payload is treated as an Opus frame and decoded.
- If `binary` is `false`, the payload is parsed as JSON and dispatched by `type`.
- 当服务器或网络出现断连,回调 `OnDisconnected()` 被触发:
- 设备会调用 `on_audio_channel_closed_()`,并最终回到空闲状态。
- When the server or network drops, `OnDisconnected()` fires:
- The device invokes `on_audio_channel_closed_()` and eventually returns to the idle state.
6. **关闭 WebSocket 连接**
- 设备在需要结束语音会话时,会调用 `CloseAudioChannel()` 主动断开连接,并回到空闲状态。
- 或者如果服务器端主动断开,也会引发同样的回调流程。
6. **Closing the WebSocket connection**
- When the device wants to end the session, it calls `CloseAudioChannel()` to tear down the socket and returns to idle.
- The same callback chain runs if the server closes the socket first.
---
## 2. 通用请求头
## 2. Common Request Headers
在建立 WebSocket 连接时,代码示例中设置了以下请求头:
When establishing the WebSocket connection, the device sets the following headers:
- `Authorization`: 用于存放访问令牌,形如 `"Bearer <token>"`
- `Protocol-Version`: 协议版本号,与 hello 消息体内的 `version` 字段保持一致
- `Device-Id`: 设备物理网卡 MAC 地址
- `Client-Id`: 软件生成的 UUID擦除 NVS 或重新烧录完整固件会重置)
- `Authorization`: access token, usually formatted as `"Bearer <token>"`.
- `Protocol-Version`: the protocol version number, matching the `version` field in the hello message.
- `Device-Id`: the physical MAC address of the device.
- `Client-Id`: a software-generated UUID (reset when NVS is erased or the full firmware is re-flashed).
这些头会随着 WebSocket 握手一起发送到服务器,服务器可根据需求进行校验、认证等。
These headers are sent with the WebSocket handshake; the server can use them for authentication or bookkeeping.
---
## 3. 二进制协议版本
## 3. Binary Protocol Versions
设备支持多种二进制协议版本,通过配置中的 `version` 字段指定:
The device supports several binary protocol versions, selected by the `version` field in settings:
### 3.1 版本1默认
直接发送 Opus 音频数据,无额外元数据。Websocket 协议会区分 text binary
### 3.1 Version 1 (default)
Raw Opus frames with no extra metadata. The WebSocket layer already distinguishes text and binary frames.
### 3.2 版本2
使用 `BinaryProtocol2` 结构:
### 3.2 Version 2
Uses the `BinaryProtocol2` structure:
```c
struct BinaryProtocol2 {
uint16_t version; // 协议版本
uint16_t type; // 消息类型 (0: OPUS, 1: JSON)
uint32_t reserved; // 保留字段
uint32_t timestamp; // 时间戳(毫秒,用于服务器端AEC
uint32_t payload_size; // 负载大小(字节)
uint8_t payload[]; // 负载数据
uint16_t version; // protocol version
uint16_t type; // message type (0: OPUS, 1: JSON)
uint32_t reserved; // reserved
uint32_t timestamp; // timestamp in milliseconds (useful for server-side AEC)
uint32_t payload_size; // payload size in bytes
uint8_t payload[]; // payload
} __attribute__((packed));
```
### 3.3 版本3
使用 `BinaryProtocol3` 结构:
### 3.3 Version 3
Uses the `BinaryProtocol3` structure:
```c
struct BinaryProtocol3 {
uint8_t type; // 消息类型
uint8_t reserved; // 保留字段
uint16_t payload_size; // 负载大小
uint8_t payload[]; // 负载数据
uint8_t type; // message type
uint8_t reserved; // reserved
uint16_t payload_size; // payload size
uint8_t payload[]; // payload
} __attribute__((packed));
```
---
## 4. JSON 消息结构
## 4. JSON Message Structure
WebSocket 文本帧以 JSON 方式传输,以下为常见的 `"type"` 字段及其对应业务逻辑。若消息里包含未列出的字段,可能为可选或特定实现细节。
WebSocket text frames carry JSON. The most common `"type"` values and their semantics are listed below. Fields that are not listed may be implementation-specific or optional.
### 4.1 设备端→服务器
### 4.1 Device -> Server
1. **Hello**
- 连接成功后,由设备端发送,告知服务器基本参数。
- 例:
1. **Hello**
- Sent once the connection is established; announces the device parameters.
- Example:
```json
{
"type": "hello",
"version": 1,
"features": {
"mcp": true
"mcp": true,
"aec": true
},
"transport": "websocket",
"audio_params": {
@ -151,14 +153,14 @@ WebSocket 文本帧以 JSON 方式传输,以下为常见的 `"type"` 字段及
}
```
2. **Listen**
- 表示设备端开始或停止录音监听。
- 常见字段:
- `"session_id"`:会话标识
- `"type": "listen"`
- `"state"``"start"`, `"stop"`, `"detect"`(唤醒检测已触发)
- `"mode"``"auto"`, `"manual"` `"realtime"`,表示识别模式。
- 例:开始监听
2. **Listen**
- Tells the server that the device is starting or stopping microphone capture.
- Common fields:
- `"session_id"`: session identifier.
- `"type": "listen"`
- `"state"`: `"start"`, `"stop"`, or `"detect"` (wake word detected).
- `"mode"`: `"auto"`, `"manual"`, or `"realtime"`.
- Example (start listening):
```json
{
"session_id": "xxx",
@ -168,9 +170,9 @@ WebSocket 文本帧以 JSON 方式传输,以下为常见的 `"type"` 字段及
}
```
3. **Abort**
- 终止当前说话TTS 播放)或语音通道。
- 例:
3. **Abort**
- Aborts the current TTS playback or the voice channel.
- Example:
```json
{
"session_id": "xxx",
@ -178,25 +180,24 @@ WebSocket 文本帧以 JSON 方式传输,以下为常见的 `"type"` 字段及
"reason": "wake_word_detected"
}
```
- `reason` 值可为 `"wake_word_detected"` 或其他。
- `reason` may be `"wake_word_detected"` or other implementation-defined values.
4. **Wake Word Detected**
- 用于设备端向服务器告知检测到唤醒词。
- 在发送该消息之前,可提前发送唤醒词的 Opus 音频数据,用于服务器进行声纹检测。
- 例:
4. **Wake Word Detected**
- Sent by the device when the local wake word detector fires.
- Opus audio containing the wake word may be streamed before this message to let the server run voice-print verification.
- Example:
```json
{
"session_id": "xxx",
"type": "listen",
"state": "detect",
"text": "你好小明"
"text": "Hi XiaoZhi"
}
```
5. **MCP**
- 推荐用于物联网控制的新一代协议。所有设备能力发现、工具调用等均通过 type: "mcp" 的消息进行,payload 内部为标准 JSON-RPC 2.0(详见 [MCP 协议文档](./mcp-protocol.md))。
- **设备端到服务器发送 result 的例子:**
- The recommended channel for IoT control. Device capability discovery and tool invocation all flow through `type: "mcp"` messages whose `payload` is JSON-RPC 2.0 (see [MCP protocol document](./mcp-protocol.md)).
- Device-to-server response example:
```json
{
"session_id": "xxx",
@ -216,34 +217,31 @@ WebSocket 文本帧以 JSON 方式传输,以下为常见的 `"type"` 字段及
---
### 4.2 服务器→设备端
### 4.2 Server -> Device
1. **Hello**
- 服务器端返回的握手确认消息。
- 必须包含 `"type": "hello"` `"transport": "websocket"`
- 可能会带有 `audio_params`,表示服务器期望的音频参数,或与设备端对齐的配置。
- 服务器可选下发 `session_id` 字段,设备端收到后会自动记录。
- 成功接收后设备端会设置事件标志,表示 WebSocket 通道就绪。
1. **Hello**
- The handshake acknowledgement.
- Must include `"type": "hello"` and `"transport": "websocket"`.
- May include `audio_params`, meaning the audio parameters the server expects / the canonical set agreed with the device.
- May include a `session_id` which the device records.
- Once received, the device sets the "audio channel open" event.
2. **STT**
2. **STT**
- `{"session_id": "xxx", "type": "stt", "text": "..."}`
- 表示服务器端识别到了用户语音。(例如语音转文本结果)
- 设备可能将此文本显示到屏幕上,后续再进入回答等流程。
- The speech-to-text result for the user utterance. Typically shown on the display before moving to the response.
3. **LLM**
3. **LLM**
- `{"session_id": "xxx", "type": "llm", "emotion": "happy", "text": "😀"}`
- 服务器指示设备调整表情动画 / UI 表达。
- Tells the device to update the emotion / facial expression on the UI.
4. **TTS**
- `{"session_id": "xxx", "type": "tts", "state": "start"}`:服务器准备下发 TTS 音频,设备端进入 "speaking" 播放状态。
- `{"session_id": "xxx", "type": "tts", "state": "stop"}`:表示本次 TTS 结束。
- `{"session_id": "xxx", "type": "tts", "state": "sentence_start", "text": "..."}`
- 让设备在界面上显示当前要播放或朗读的文本片段(例如用于显示给用户)。
4. **TTS**
- `{"session_id": "xxx", "type": "tts", "state": "start"}`: the server is about to stream TTS audio. The device transitions to the speaking state.
- `{"session_id": "xxx", "type": "tts", "state": "stop"}`: the TTS segment is finished.
- `{"session_id": "xxx", "type": "tts", "state": "sentence_start", "text": "..."}`: show the current sentence on the UI (for example, subtitle display).
5. **MCP**
- 服务器通过 type: "mcp" 的消息下发物联网相关的控制指令或返回调用结果payload 结构同上。
- **服务器到设备端发送 tools/call 的例子:**
- The server sends IoT-related commands or receives tool-call results. The `payload` structure follows JSON-RPC 2.0.
- Server-to-device `tools/call` example:
```json
{
"session_id": "xxx",
@ -261,8 +259,8 @@ WebSocket 文本帧以 JSON 方式传输,以下为常见的 `"type"` 字段及
```
6. **System**
- 系统控制命令,常用于远程升级更新。
- 例:
- System-level control, often used for remote upgrades / management.
- Example:
```json
{
"session_id": "xxx",
@ -270,153 +268,190 @@ WebSocket 文本帧以 JSON 方式传输,以下为常见的 `"type"` 字段及
"command": "reboot"
}
```
- 支持的命令:
- `"reboot"`:重启设备
- Supported commands:
- `"reboot"`: reboot the device.
7. **Custom**(可选)
- 自定义消息,当 `CONFIG_RECEIVE_CUSTOM_MESSAGE` 启用时支持。
- 例:
7. **Alert**
- Instructs the device to show an alert and play a vibration sound. Handled in `Application::OnIncomingJson`.
- Example:
```json
{
"session_id": "xxx",
"type": "alert",
"status": "Warning",
"message": "Battery low",
"emotion": "sad"
}
```
- Fields:
- `status`: short title displayed on screen.
- `message`: detailed message.
- `emotion`: emotion shown while alerting (e.g. `"sad"`, `"neutral"`).
8. **Custom** (optional)
- Available when `CONFIG_RECEIVE_CUSTOM_MESSAGE` is enabled.
- Example:
```json
{
"session_id": "xxx",
"type": "custom",
"payload": {
"message": "自定义内容"
"message": "anything you want"
}
}
```
8. **音频数据:二进制帧**
- 当服务器发送音频二进制帧Opus 编码)时,设备端解码并播放。
- 若设备端正在处于 "listening" (录音)状态,收到的音频帧会被忽略或清空以防冲突。
9. **Binary audio frames**
- When the server pushes Opus-encoded audio as binary frames, the device decodes and plays them.
- Frames received while the device is in the `listening` state are dropped to avoid conflicts with the microphone stream.
---
## 5. 音频编解码
## 5. Audio Codec
1. **设备端发送录音数据**
- 音频输入经过可能的回声消除、降噪或音量增益后,通过 Opus 编码打包为二进制帧发送给服务器。
- 根据协议版本,可能直接发送 Opus 数据版本1或使用带元数据的二进制协议版本2/3
1. **Device uploads microphone audio**
- After optional AEC / NR / AGC processing, the audio is Opus-encoded and sent as binary frames.
- Depending on the protocol version, the frames may be raw Opus (v1) or wrapped in the metadata structures (v2/v3).
2. **设备端播放收到的音频**
- 收到服务器的二进制帧时,同样认定是 Opus 数据。
- 设备端会进行解码,然后交由音频输出接口播放。
- 如果服务器的音频采样率与设备不一致,会在解码后再进行重采样。
2. **Device plays server audio**
- Incoming binary frames are also treated as Opus.
- The device decodes and sends them to the audio output.
- If the sample rate differs from the device's output, it is resampled after decoding.
---
## 6. 常见状态流转
## 6. Device States
以下为常见设备端关键状态流转,与 WebSocket 消息对应:
### 6.1 Main states
1. **Idle** → **Connecting**
- 用户触发或唤醒后,设备调用 `OpenAudioChannel()` → 建立 WebSocket 连接 → 发送 `"type":"hello"`。
The device state machine is defined in [`main/device_state.h`](../main/device_state.h) and includes:
2. **Connecting** → **Listening**
- 成功建立连接后,若继续执行 `SendStartListening(...)`,则进入录音状态。此时设备会持续编码麦克风数据并发送到服务器。
- `kDeviceStateUnknown`
- `kDeviceStateStarting`
- `kDeviceStateWifiConfiguring`
- `kDeviceStateIdle`
- `kDeviceStateConnecting`
- `kDeviceStateListening`
- `kDeviceStateSpeaking`
- `kDeviceStateUpgrading`
- `kDeviceStateActivating`
- `kDeviceStateAudioTesting` (factory / bring-up audio testing)
- `kDeviceStateFatalError` (non-recoverable error requiring user action)
3. **Listening** → **Speaking**
- 收到服务器 TTS Start 消息 (`{"type":"tts","state":"start"}`) → 停止录音并播放接收到的音频。
### 6.2 Typical transitions
4. **Speaking** → **Idle**
- 服务器 TTS Stop (`{"type":"tts","state":"stop"}`) → 音频播放结束。若未继续进入自动监听,则返回 Idle如果配置了自动循环则再度进入 Listening。
1. **Idle -> Connecting**
- Triggered by wake word or button press. The device calls `OpenAudioChannel()`, sets up the WebSocket, and sends `"type":"hello"`.
5. **Listening** / **Speaking** → **Idle**(遇到异常或主动中断)
- 调用 `SendAbortSpeaking(...)` 或 `CloseAudioChannel()` → 中断会话 → 关闭 WebSocket → 状态回到 Idle。
2. **Connecting -> Listening**
- Once connected, `SendStartListening(...)` is called and microphone streaming begins.
### 自动模式状态流转图
3. **Listening -> Speaking**
- Server sends `{"type":"tts","state":"start"}`; the device stops sending mic audio and plays incoming TTS.
4. **Speaking -> Idle**
- Server sends `{"type":"tts","state":"stop"}`. When auto-continue is enabled the device transitions back to Listening; otherwise it returns to Idle.
5. **Listening / Speaking -> Idle** (abort)
- `SendAbortSpeaking(...)` or `CloseAudioChannel()` interrupts the session and closes the WebSocket.
### 6.3 Auto-mode state diagram
```mermaid
stateDiagram
direction TB
[*] --> kDeviceStateUnknown
kDeviceStateUnknown --> kDeviceStateStarting:初始化
kDeviceStateStarting --> kDeviceStateWifiConfiguring:配置WiFi
kDeviceStateStarting --> kDeviceStateActivating:激活设备
kDeviceStateActivating --> kDeviceStateUpgrading:检测到新版本
kDeviceStateActivating --> kDeviceStateIdle:激活完成
kDeviceStateIdle --> kDeviceStateConnecting:开始连接
kDeviceStateConnecting --> kDeviceStateIdle:连接失败
kDeviceStateConnecting --> kDeviceStateListening:连接成功
kDeviceStateListening --> kDeviceStateSpeaking:开始说话
kDeviceStateSpeaking --> kDeviceStateListening:结束说话
kDeviceStateListening --> kDeviceStateIdle:手动终止
kDeviceStateSpeaking --> kDeviceStateIdle:自动终止
kDeviceStateUnknown --> kDeviceStateStarting: Initialize
kDeviceStateStarting --> kDeviceStateWifiConfiguring: Configure WiFi
kDeviceStateStarting --> kDeviceStateActivating: Activate device
kDeviceStateActivating --> kDeviceStateUpgrading: New firmware detected
kDeviceStateActivating --> kDeviceStateIdle: Activation complete
kDeviceStateIdle --> kDeviceStateConnecting: Start connecting
kDeviceStateConnecting --> kDeviceStateIdle: Connection failed
kDeviceStateConnecting --> kDeviceStateListening: Connection succeeded
kDeviceStateListening --> kDeviceStateSpeaking: TTS start
kDeviceStateSpeaking --> kDeviceStateListening: TTS stop
kDeviceStateListening --> kDeviceStateIdle: Manual abort
kDeviceStateSpeaking --> kDeviceStateIdle: Auto stop
kDeviceStateStarting --> kDeviceStateAudioTesting: Factory audio test
kDeviceStateStarting --> kDeviceStateFatalError: Fatal error
```
### 手动模式状态流转图
### 6.4 Manual-mode state diagram
```mermaid
stateDiagram
direction TB
[*] --> kDeviceStateUnknown
kDeviceStateUnknown --> kDeviceStateStarting:初始化
kDeviceStateStarting --> kDeviceStateWifiConfiguring:配置WiFi
kDeviceStateStarting --> kDeviceStateActivating:激活设备
kDeviceStateActivating --> kDeviceStateUpgrading:检测到新版本
kDeviceStateActivating --> kDeviceStateIdle:激活完成
kDeviceStateIdle --> kDeviceStateConnecting:开始连接
kDeviceStateConnecting --> kDeviceStateIdle:连接失败
kDeviceStateConnecting --> kDeviceStateListening:连接成功
kDeviceStateIdle --> kDeviceStateListening:开始监听
kDeviceStateListening --> kDeviceStateIdle:停止监听
kDeviceStateIdle --> kDeviceStateSpeaking:开始说话
kDeviceStateSpeaking --> kDeviceStateIdle:结束说话
kDeviceStateUnknown --> kDeviceStateStarting: Initialize
kDeviceStateStarting --> kDeviceStateWifiConfiguring: Configure WiFi
kDeviceStateStarting --> kDeviceStateActivating: Activate device
kDeviceStateActivating --> kDeviceStateUpgrading: New firmware detected
kDeviceStateActivating --> kDeviceStateIdle: Activation complete
kDeviceStateIdle --> kDeviceStateConnecting: Start connecting
kDeviceStateConnecting --> kDeviceStateIdle: Connection failed
kDeviceStateConnecting --> kDeviceStateListening: Connection succeeded
kDeviceStateIdle --> kDeviceStateListening: Start listening
kDeviceStateListening --> kDeviceStateIdle: Stop listening
kDeviceStateIdle --> kDeviceStateSpeaking: Start speaking
kDeviceStateSpeaking --> kDeviceStateIdle: Stop speaking
```
---
## 7. 错误处理
## 7. Error Handling
1. **连接失败**
- 如果 `Connect(url)` 返回失败或在等待服务器 "hello" 消息时超时,触发 `on_network_error_()` 回调。设备会提示"无法连接到服务"或类似错误信息。
1. **Connection failure**
- If `Connect(url)` fails or the server hello is not received before the timeout, `on_network_error_()` is invoked and the device shows a "cannot connect" alert.
2. **服务器断开**
- 如果 WebSocket 异常断开,回调 `OnDisconnected()`
- 设备回调 `on_audio_channel_closed_()`
- 切换到 Idle 或其他重试逻辑。
2. **Server disconnect**
- If the WebSocket drops unexpectedly, `OnDisconnected()` is called:
- `on_audio_channel_closed_()` runs.
- The device returns to Idle (or retries, depending on policy).
---
## 8. 其它注意事项
## 8. Other Notes
1. **鉴权**
- 设备通过设置 `Authorization: Bearer <token>` 提供鉴权,服务器端需验证是否有效。
- 如果令牌过期或无效,服务器可拒绝握手或在后续断开。
1. **Authentication**
- The device supplies `Authorization: Bearer <token>`; the server must validate it.
- If the token is missing or invalid the server may reject the handshake or terminate the session later.
2. **会话控制**
- 代码中部分消息包含 `session_id`,用于区分独立的对话或操作。服务端可根据需要对不同会话做分离处理。
2. **Session scope**
- Many messages carry a `session_id`, useful when the server serves multiple concurrent interactions.
3. **音频负载**
- 代码里默认使用 Opus 格式,并设置 `sample_rate = 16000`,单声道。帧时长由 `OPUS_FRAME_DURATION_MS` 控制,一般为 60ms。可根据带宽或性能做适当调整。为了获得更好的音乐播放效果,服务器下行音频可能使用 24000 采样率。
3. **Audio payload**
- Default audio format is Opus at 16 kHz, mono. The frame duration is controlled by `OPUS_FRAME_DURATION_MS` (typically 60 ms). The server may use 24 kHz on the downlink for better music playback.
4. **协议版本配置**
- 通过设置中的 `version` 字段配置二进制协议版本1、2 或 3
- 版本1直接发送 Opus 数据
- 版本2使用带时间戳的二进制协议适用于服务器端 AEC
- 版本3使用简化的二进制协议
4. **Binary protocol version selection**
- Configured through the `version` setting:
- v1: raw Opus
- v2: metadata + timestamp (useful for server-side AEC)
- v3: lightweight header
- The value is echoed back in the `Protocol-Version` header and the hello message.
5. **物联网控制推荐 MCP 协议**
- 设备与服务器之间的物联网能力发现、状态同步、控制指令等,建议全部通过 MCP 协议(type: "mcp")实现。原有的 type: "iot" 方案已废弃。
- MCP 协议可在 WebSocket、MQTT 等多种底层协议上传输,具备更好的扩展性和标准化能力。
- 详细用法请参考 [MCP 协议文档](./mcp-protocol.md) [MCP 物联网控制用法](./mcp-usage.md)
5. **IoT control via MCP**
- All IoT capability discovery and control flows through MCP (`type: "mcp"`). The legacy `type: "iot"` protocol is deprecated.
- MCP works over both WebSocket and MQTT, giving better standardization and extensibility.
- See [MCP protocol document](./mcp-protocol.md) and [MCP IoT control usage](./mcp-usage.md) for details.
6. **错误或异常 JSON**
- 当 JSON 中缺少必要字段,例如 `{"type": ...}`,设备端会记录错误日志(`ESP_LOGE(TAG, "Missing message type, data: %s", data);`),不会执行任何业务。
6. **Malformed JSON**
- When a required field such as `type` is missing, the device logs `ESP_LOGE(TAG, "Missing message type, data: %s", data);` and ignores the message.
---
## 9. 消息示例
## 9. Example Message Flow
下面给出一个典型的双向消息示例(流程简化示意):
A simplified two-way exchange:
1. **设备端 → 服务器**(握手)
1. **Device -> Server** (handshake)
```json
{
"type": "hello",
"version": 1,
"features": {
"mcp": true
"mcp": true,
"aec": true
},
"transport": "websocket",
"audio_params": {
@ -428,7 +463,7 @@ stateDiagram
}
```
2. **服务器 → 设备端**(握手应答)
2. **Server -> Device** (handshake ack)
```json
{
"type": "hello",
@ -441,7 +476,7 @@ stateDiagram
}
```
3. **设备端 → 服务器**(开始监听)
3. **Device -> Server** (start listening)
```json
{
"session_id": "xxx",
@ -450,18 +485,18 @@ stateDiagram
"mode": "auto"
}
```
同时设备端开始发送二进制帧Opus 数据)。
The device begins streaming binary Opus frames.
4. **服务器 → 设备端**ASR 结果)
4. **Server -> Device** (ASR result)
```json
{
"session_id": "xxx",
"type": "stt",
"text": "用户说的话"
"text": "what the user said"
}
```
5. **服务器 → 设备端**TTS开始
5. **Server -> Device** (TTS start)
```json
{
"session_id": "xxx",
@ -469,9 +504,9 @@ stateDiagram
"state": "start"
}
```
接着服务器发送二进制音频帧给设备端播放。
The server follows up with binary Opus frames for the device to play.
6. **服务器 → 设备端**TTS结束
6. **Server -> Device** (TTS stop)
```json
{
"session_id": "xxx",
@ -479,17 +514,17 @@ stateDiagram
"state": "stop"
}
```
设备端停止播放音频,若无更多指令,则回到空闲状态。
The device stops playback and, if no further instructions arrive, returns to idle.
---
## 10. 总结
## 10. Summary
本协议通过在 WebSocket 上层传输 JSON 文本与二进制音频帧完成功能包括音频流上传、TTS 音频播放、语音识别与状态管理、MCP 指令下发等。其核心特征:
This protocol carries JSON text and binary Opus frames over a WebSocket connection to implement audio streaming, TTS playback, speech recognition, device state management, MCP dispatch, and more. Key traits:
- **握手阶段**:发送 `"type":"hello"`,等待服务器返回。
- **音频通道**:采用 Opus 编码的二进制帧双向传输语音流,支持多种协议版本。
- **JSON 消息**:使用 `"type"` 为核心字段标识不同业务逻辑,包括 TTSSTTMCPWakeWordSystem、Custom 等。
- **扩展性**:可根据实际需求在 JSON 消息中添加字段,或在 headers 里进行额外鉴权。
- **Handshake**: send `"type":"hello"` and wait for the server reply.
- **Audio channel**: bidirectional Opus streaming, with three binary framing variants.
- **JSON messages**: dispatched by `"type"` (TTS, STT, MCP, WakeWord, System, Alert, Custom, ...).
- **Extensibility**: extra fields in JSON, additional headers for authentication.
服务器与设备端需提前约定各类消息的字段含义、时序逻辑以及错误处理规则,方能保证通信顺畅。上述信息可作为基础文档,便于后续对接、开发或扩展。
Server and device must agree on the meaning, timing, and error handling of each message type so the session runs smoothly. The text above provides the baseline for integration, debugging, and extension.