Files
xiaozhi-esp32/docs/mqtt-udp.md
Terrence 87f6faee79 Update Chinese and English documentation for custom board setup and MCP protocol
- Added a comprehensive guide for creating custom boards in the XiaoZhi AI project, detailing directory structure, configuration files, and initialization code.
- Introduced a new document explaining the MCP protocol for IoT control, including message formats and interaction flows.
- Updated existing documentation to reflect changes in tool registration and usage examples for the MCP protocol.
- Enhanced README files for better clarity and consistency across languages.
2026-04-17 03:36:37 +08:00

410 lines
9.5 KiB
Markdown

# MQTT + UDP Hybrid Communication Protocol
This document describes the MQTT + UDP hybrid protocol used between the device and the server, based on the current implementation: MQTT carries control messages, UDP carries real-time audio.
---
## 1. Overview
The protocol uses two channels:
- **MQTT** - control messages, state synchronization, JSON payloads.
- **UDP** - real-time audio, encrypted.
### 1.1 Key characteristics
- **Dual channel design** - control is separated from data so audio has low latency.
- **Encrypted transport** - UDP audio is encrypted with AES-CTR.
- **Sequence numbers** - guard against replay and reordering.
- **Automatic reconnect** - MQTT reconnects on disconnect.
---
## 2. End-to-end Flow
```mermaid
sequenceDiagram
participant Device as ESP32 device
participant MQTT as MQTT broker
participant UDP as UDP server
Note over Device, UDP: 1. Establish MQTT connection
Device->>MQTT: MQTT Connect
MQTT->>Device: Connected
Note over Device, UDP: 2. Request audio channel
Device->>MQTT: Hello message (type: "hello", transport: "udp")
MQTT->>Device: Hello response (UDP endpoint + encryption keys)
Note over Device, UDP: 3. Establish UDP connection
Device->>UDP: UDP Connect
UDP->>Device: Connected
Note over Device, UDP: 4. Audio streaming
loop Audio stream
Device->>UDP: Encrypted audio (Opus)
UDP->>Device: Encrypted audio (Opus)
end
Note over Device, UDP: 5. Control messages
par Control
Device->>MQTT: Listen / TTS / MCP messages
MQTT->>Device: STT / TTS / MCP / Alert responses
end
Note over Device, UDP: 6. Teardown
Device->>MQTT: Goodbye
Device->>UDP: Disconnect
```
---
## 3. MQTT Control Channel
### 3.1 Connection
The device connects to the broker using:
- **Endpoint** - broker host and port.
- **Client ID** - device identifier.
- **Username / Password** - credentials.
- **Keep Alive** - heartbeat interval (default 240 s).
### 3.2 Hello exchange
#### 3.2.1 Device -> Server
```json
{
"type": "hello",
"version": 3,
"transport": "udp",
"features": {
"mcp": true,
"aec": true
},
"audio_params": {
"format": "opus",
"sample_rate": 16000,
"channels": 1,
"frame_duration": 60
}
}
```
`features.mcp` is always set; `features.aec` is set when `CONFIG_USE_SERVER_AEC` is enabled.
#### 3.2.2 Server -> Device
```json
{
"type": "hello",
"transport": "udp",
"session_id": "xxx",
"audio_params": {
"format": "opus",
"sample_rate": 24000,
"channels": 1,
"frame_duration": 60
},
"udp": {
"server": "192.168.1.100",
"port": 8888,
"key": "0123456789ABCDEF0123456789ABCDEF",
"nonce": "0123456789ABCDEF0123456789ABCDEF"
}
}
```
Field reference:
- `udp.server` - UDP server address.
- `udp.port` - UDP server port.
- `udp.key` - AES key, hex-encoded.
- `udp.nonce` - AES nonce, hex-encoded.
### 3.3 JSON message types
#### 3.3.1 Device -> Server
1. **Listen**
```json
{
"session_id": "xxx",
"type": "listen",
"state": "start",
"mode": "manual"
}
```
2. **Abort**
```json
{
"session_id": "xxx",
"type": "abort",
"reason": "wake_word_detected"
}
```
3. **MCP**
```json
{
"session_id": "xxx",
"type": "mcp",
"payload": {
"jsonrpc": "2.0",
"id": 1,
"result": {}
}
}
```
4. **Goodbye**
```json
{
"session_id": "xxx",
"type": "goodbye"
}
```
#### 3.3.2 Server -> Device
Semantics match the WebSocket protocol. Supported types:
- **STT** - speech recognition result.
- **TTS** - TTS lifecycle (`start`, `stop`, `sentence_start`).
- **LLM** - emotion update for the UI.
- **MCP** - IoT control.
- **System** - system control, e.g. `"command": "reboot"`.
- **Alert** - show an alert on the UI; fields: `status`, `message`, `emotion`.
- **Goodbye** - server-initiated shutdown of the audio session. The device responds by closing the UDP channel without sending its own goodbye.
- **Custom** (optional, enabled via `CONFIG_RECEIVE_CUSTOM_MESSAGE`).
Example alert:
```json
{
"session_id": "xxx",
"type": "alert",
"status": "Warning",
"message": "Battery low",
"emotion": "sad"
}
```
---
## 4. UDP Audio Channel
### 4.1 Establishing the channel
After the device receives the MQTT hello response, it:
1. Parses the UDP host and port.
2. Parses the AES key and nonce.
3. Initializes the AES-CTR context.
4. Opens the UDP socket.
### 4.2 Audio packet format
#### 4.2.1 Encrypted audio packet
```
|type 1B|flags 1B|payload_len 2B|ssrc 4B|timestamp 4B|sequence 4B|
|payload payload_len bytes|
```
Field reference:
- `type`: packet type, always `0x01`.
- `flags`: flags, currently unused.
- `payload_len`: payload length (network byte order).
- `ssrc`: synchronization source identifier.
- `timestamp`: timestamp (network byte order).
- `sequence`: sequence number (network byte order).
- `payload`: encrypted Opus audio data.
#### 4.2.2 Encryption
Uses **AES-CTR** with:
- **Key**: 128-bit, provided by the server.
- **Nonce**: 128-bit, provided by the server.
- **Counter**: built from the timestamp and sequence number.
### 4.3 Sequence number management
- **Sender**: `local_sequence_` is incremented monotonically.
- **Receiver**: `remote_sequence_` validates continuity.
- **Anti-replay**: packets with sequence numbers below the expected value are dropped.
- **Tolerance**: small gaps are logged as warnings but still accepted.
### 4.4 Error handling
1. **Decryption failure** - log an error and drop the packet.
2. **Sequence gap** - log a warning, continue processing the packet.
3. **Malformed packet** - log an error and drop.
---
## 5. State Management
### 5.1 Connection state
```mermaid
stateDiagram
direction TB
[*] --> Disconnected
Disconnected --> MqttConnecting: StartMqttClient()
MqttConnecting --> MqttConnected: MQTT Connected
MqttConnecting --> Disconnected: Connect failed
MqttConnected --> RequestingChannel: OpenAudioChannel()
RequestingChannel --> ChannelOpened: Hello exchange success
RequestingChannel --> MqttConnected: Hello timeout / failed
ChannelOpened --> UdpConnected: UDP connect success
UdpConnected --> AudioStreaming: Start audio
AudioStreaming --> UdpConnected: Stop audio
UdpConnected --> ChannelOpened: UDP disconnect
ChannelOpened --> MqttConnected: CloseAudioChannel()
MqttConnected --> Disconnected: MQTT disconnect
```
### 5.2 State check
The device determines whether the audio channel is available with:
```cpp
bool IsAudioChannelOpened() const {
return udp_ != nullptr && !error_occurred_ && !IsTimeout();
}
```
---
## 6. Configuration Parameters
### 6.1 MQTT settings
Read from storage:
- `endpoint` - broker address.
- `client_id` - client identifier.
- `username` - user name.
- `password` - password.
- `keepalive` - keep-alive interval (default 240 s).
- `publish_topic` - publish topic.
### 6.2 Audio parameters
- **Format**: Opus
- **Sample rate**: 16 kHz device / 24 kHz server
- **Channels**: 1 (mono)
- **Frame duration**: 60 ms
---
## 7. Error Handling and Reconnection
### 7.1 MQTT reconnect
- Automatic retry on connect failure.
- Optional error reporting.
- Clean-up runs on disconnect.
### 7.2 UDP connection
- No automatic retry; depends on re-negotiation via MQTT.
- Status can be queried at any time.
### 7.3 Timeouts
The base `Protocol` class provides timeout detection:
- Default timeout: 120 s.
- Based on the time since the last incoming packet.
- After a timeout the channel is marked unavailable.
---
## 8. Security
### 8.1 Transport encryption
- **MQTT**: supports TLS/SSL (port 8883).
- **UDP**: AES-CTR on audio payloads.
### 8.2 Authentication
- **MQTT**: user name / password.
- **UDP**: keys are distributed via the MQTT channel.
### 8.3 Anti-replay
- Monotonically increasing sequence numbers.
- Stale packets are dropped.
- Timestamps are validated.
---
## 9. Performance Notes
### 9.1 Concurrency
A mutex protects the UDP connection:
```cpp
std::lock_guard<std::mutex> lock(channel_mutex_);
```
### 9.2 Memory management
- Network objects are created and destroyed dynamically.
- Audio packets are managed with smart pointers.
- Encryption contexts are released promptly.
### 9.3 Network optimizations
- UDP connection reuse.
- Reasonable packet sizes.
- Sequence continuity checks.
---
## 10. Comparison with WebSocket
| Feature | MQTT + UDP | WebSocket |
|---------|------------|-----------|
| Control channel | MQTT | WebSocket |
| Audio channel | UDP (encrypted) | WebSocket (binary) |
| Latency | Low (UDP) | Medium |
| Reliability | Medium | High |
| Complexity | High | Low |
| Encryption | AES-CTR | TLS |
| Firewall friendliness | Low | High |
---
## 11. Deployment Notes
### 11.1 Network
- Ensure UDP ports are reachable.
- Configure firewall rules accordingly.
- Plan for NAT traversal if needed.
### 11.2 Server infrastructure
- MQTT broker configuration.
- UDP server deployment.
- Key management.
### 11.3 Monitoring
- Connection success rate.
- Audio transmission latency.
- Packet loss.
- Decryption failures.
---
## 12. Summary
The MQTT + UDP hybrid protocol achieves efficient audio communication through:
- **Split architecture** - separate control and data channels with clear responsibilities.
- **Encryption** - AES-CTR protects audio payloads.
- **Sequence management** - prevents replay and reordering.
- **Automatic recovery** - MQTT reconnects on failure.
- **Performance** - UDP keeps audio latency low.
The protocol is a good fit for low-latency voice interaction, at the cost of higher network complexity than pure WebSocket.