beaver_project/app-instance/backend/flow.md

# Beaver Backend Flow

这份文档只记录两件事：

1. 我们**为什么这么实现**
2. 当前代码里**真实已经实现了什么**

它不是蓝图，也不是未来设计草稿。以后只要主链、装配逻辑、运行时边界发生变化，就必须同步更新它。

---

## 1. 参考项目各自借什么

当前 Beaver 的实现思路，主要借了三个参考项目，但借的点是分开的。

### 1.1 `OpenHarness`

借的是**模块边界和 Harness 形态**：

1. `Harness / Runtime` 应该和 Web、Gateway、产品接入分开
2. `skills / memory / tools / session / orchestration` 都属于平台层
3. 运行时最好是可装配的，而不是所有逻辑都塞进一个大 agent 类

所以 Beaver 现在一直在做的事情，是把：

- `EngineLoader`
- `AgentLoop`
- `ContextBuilder`
- `Session`
- `Tools`
- `Skills`

收成一个清晰的运行内核。

### 1.2 `hermes-agent`

借的是**memory、skills、session 的运行时风格**：

1. memory 用 curated CRUD + frozen snapshot
2. `session_search` 查历史细节，不把所有历史都塞进 memory
3. skills 用：
   - 显式 skill loading path
   - 激活后的 skill 正文显式注入

所以 Beaver 现在这些点都明显受 Hermes 影响：

1. `MemoryService` + frozen snapshot
2. `session_search`
3. `skill_view`
4. activated skill messages

### 1.3 `swarms`

借的是**后面多智能体 orchestration 的方向**：

1. team orchestration
2. swarm strategy
3. multi-agent execution backend

但要注意：它现在**还不是当前主链的核心**。
当前我们主要先把单 agent runtime 打稳，多智能体还没正式接回主链。

---

## 2. 当前我们到底做到哪了

当前已经不是“搭骨架”阶段了，而是：

**最小单 agent runtime 已经跑通。**

现在已经完成的核心段落是：

1. `4.1 session`
2. `4.2 provider`
3. `4.3 context`
4. `4.4 tools`
5. `4.5 最小主链`
6. `5.1 memory 最小接入`
7. `5.2 skills 最小接入`
8. `6.1 session-first / event-source 第一阶段`

更准确地说，当前 Beaver 已经有：

1. 一个可运行的 `AgentService -> AgentLoop` 主链
2. 一个外部化的 Session 子系统
3. 一个可工作的 tool loop
4. Hermes 风格的 memory / skills 接入
5. LLM-driven 的 `SkillAssembler`

但还没有：

1. 更完整的 shutdown hooks
2. Web / Gateway 的 bus / channels / realtime 全量接入
3. delegation / swarm / team runtime
4. 权限系统
5. MCP 全量工具接回 runtime

---

## 3. 当前真实主链

当前主入口已经不是 CLI 逻辑，而是：

```python
service = AgentService()
await service.process_direct("你好")
```

同时，第 6 阶段的最小运行循环已经有了：

```python
service = AgentService()
await service.start()
result = await service.submit_direct("你好")
await service.stop()
service.close()
```

宿主层现在也已经开始接到这条 lifecycle 上：

```python
app = create_app()        # FastAPI lifespan 内部托管 AgentService.start()/shutdown()
await run_gateway()       # Gateway 常驻进程托管 AgentService.start()/shutdown()
```

这套 lifecycle 当前明确是：

1. `start()` 进入一个 `AgentLoop` 实例的运行模式
2. 运行模式下，外部任务只能走 `submit_direct()`
3. 运行模式下，不允许再直接调用 `process_direct()`
4. `stop()` 是 **instance-scoped**
   - 只针对当前这个 `AgentLoop` 实例
   - 不是 session-scoped
   - 也不是 platform-scoped
5. `stop()` 调用后会拒绝新任务，已入队任务正常收尾
6. `stop()` / `shutdown()` 支持 graceful timeout；必要时可 force cancel
7. `close()` 只能在该实例已停止后调用

### 3.1 Web / Gateway 当前怎么接

这一层现在已经不是纯占位了，而是最小宿主层：

1. `beaver/interfaces/web/app.py`
   - FastAPI lifespan 启动时：
     - 创建或接收 `AgentService`
     - 如果 app 自己创建 service，则 `await service.start()`
   - Web 接口现在有最小正式 schema：
     - `WebChatRequest`
     - `WebChatResponse`
     - `WebStatusResponse`
   - `/api/chat` 请求：
     - 用结构化 request schema 校验输入
     - `await service.submit_direct(...)`
     - 把常见 runtime / config 错误收成 HTTP 错误
     - 外部注入但尚未进入 running mode 的 service，会返回 `503`
   - `/api/ping`：
     - 返回 `status/running/mode`
     - 不会为了 health check 额外 boot runtime
   - app 关闭时：
     - 如果 app 自己创建 service，则 `await service.shutdown(timeout_seconds=5.0, force=True)`
   - app 自己接管 lifecycle 时：
     - 若 `start()` 失败，会立即 `close()` 做 startup cleanup

2. `beaver/interfaces/gateway/main.py`
   - `run_gateway()` 启动时：
     - 如果 gateway 自己创建 service，则 `await service.start()`
   - 持有最小 `MessageBus`
   - 常驻消费 `bus.inbound`
   - 调 `await service.submit_direct(...)`
   - 把结果写回 `bus.outbound`
   - 同时等待 `stop_event`
   - 退出时：
     - 先尝试 `await service.shutdown(timeout_seconds=5.0, force=True)`
     - 再等待 bridge 协程收尾；必要时取消 bridge
    - 如果 gateway 自己接管 lifecycle 且 `start()` 失败：
      - 会立即 `close()` 做 startup cleanup
   - 未处理完的 inbound：
     - 不再静默丢下
     - 会被冲刷成结构化 outbound error

3. `beaver/foundation/events/message_bus.py`
   - 已有最小：
     - `MessageBus`
     - `InboundMessage`
     - `OutboundMessage`
   - 当前只做双队列桥接：
     - `inbound`
     - `outbound`
   - 还没有 broker / topic routing / retry / persistence

所以现在已经明确：

1. Web / Gateway 属于宿主层
2. 它们不直接 new `AgentLoop` 或绕过运行模式
3. 它们复用：
   - `start()`
   - `submit_direct()`
   - `stop()`
   - `shutdown()`
4. ownership 语义：
   - 自己创建的 `AgentService`：自己负责 lifecycle
   - 外部注入的 `AgentService`：默认不自动 start/shutdown，除非显式要求接管
5. gateway 已经从“只会常驻等待”推进到“最小消息桥接层”
   - external inbound message
   - `MessageBus.inbound`
   - `service.submit_direct(...)`
   - `MessageBus.outbound`

### 3.2 总体链路

当前代码里的主链可以概括成：

```text
AgentService
  -> AgentLoop
    -> Session
    -> Memory
    -> SkillAssembler
    -> ContextBuilder
    -> Provider
    -> ToolExecutor
    -> Session writeback
```

### 3.3 详细顺序

```text
用户输入 task
│
├─ AgentService.create_loop()
│  ├─ 创建 AgentLoop(profile, loader)
│  └─ loop.boot()
│
├─ AgentLoop.boot()
│  └─ EngineLoader.load()
│     ├─ SessionManager
│     ├─ MemoryStore
│     ├─ MemoryService
│     ├─ ToolRegistry
│     ├─ ToolExecutor
│     ├─ SkillsLoader
│     ├─ SkillAssembler
│     └─ ContextBuilder
│
├─ AgentLoop.process_direct(task)
│  │
│  ├─ 生成 `session_id` / `run_id`
│  │
│  ├─ memory_service.reload_for_new_run()
│  │  └─ 建立本轮 frozen memory snapshot
│  │
│  ├─ sessions.ensure_session(session_id)
│  ├─ sessions.append_message(event_type="run_started", hidden)
│  │
│  ├─ make_provider_bundle()
│  │  ├─ main provider
│  │  ├─ fallback provider
│  │  ├─ auxiliary provider 可用于 skill 选择
│  │  └─ embedding runtime 提供 embeddings 的 model/api_key/api_base
│  │     说明：它是独立配置线，只支持 OpenAI-compatible embeddings endpoint
│  │
│  ├─ skill_assembler.assemble(task_description=task, provider=selector_provider, embedding_runtime=..., ...)
│  │  ├─ 读取全量可用 skill 候选摘要
│  │  ├─ 用 `text-embedding-v4` 对全量候选做相似度召回
│  │  ├─ 把召回结果交给 LLM 做最终选择
│  │  └─ 返回 activated_skills
│  │
│  ├─ ContextBuilder.build_skill_activation_messages(...)
│  ├─ 如果 activated_skills 非空：
│  │  └─ sessions.append_message(event_type="skill_activation_snapshotted", hidden)
│  │
│  ├─ ContextBuilder.build_messages()
│  │  ├─ system prompt 包含：
│  │  │  ├─ base system prompt
│  │  │  ├─ session metadata
│  │  │  ├─ execution context
│  │  │  └─ frozen memory snapshot
│  │  ├─ messages 里显式插入 activated skill messages
│  │  ├─ 再拼 visible history
│  │  └─ 最后追加当前 user input
│  │
│  ├─ sessions.update_system_prompt()
│  ├─ sessions.append_message(event_type="system_prompt_snapshotted", hidden)
│  ├─ sessions.append_message(event_type="user_message_added")
│  │
│  ├─ 进入最小 tool loop
│  │  ├─ provider.chat(messages, tools=schemas)
│  │  ├─ sessions.update_usage()
│  │  ├─ sessions.append_message(event_type="assistant_message_added")
│  │  ├─ ContextBuilder.add_assistant_message(...)
│  │  ├─ 如果没有 tool calls：
│  │  │  └─ 结束
│  │  └─ 如果有 tool calls：
│  │     ├─ ToolExecutor.execute_tool_call(...)
│  │     ├─ sessions.append_message(event_type="tool_result_recorded")
│  │     ├─ ContextBuilder.add_tool_result(...)
│  │     └─ 再回 provider.chat(...)
│  │
│  ├─ 成功结束：
│  │  └─ sessions.append_message(event_type="run_completed", hidden)
│  │
│  ├─ 异常结束：
│  │  ├─ 补 assistant error message
│  │  └─ sessions.append_message(event_type="run_failed", hidden)
│  │
│  └─ return AgentRunResult
│     ├─ session_id
│     ├─ run_id
│     ├─ output_text
│     ├─ finish_reason
│     ├─ tool_iterations
│     ├─ provider_name
│     ├─ model
│     └─ usage
```

---

## 4. 当前模块边界

### 4.1 `EngineLoader`

职责：装配运行时依赖。

当前已经装配：

1. `SessionManager`
2. `MemoryStore`
3. `MemoryService`
4. `ToolRegistry`
5. `ToolExecutor`
6. `SkillsLoader`
7. `SkillAssembler`
8. `ContextBuilder`

### 4.2 `AgentLoop`

职责：执行单次 run。

当前已经负责：

1. direct run 主链
2. provider 调用
3. 最小 tool loop
4. session 事件写回
5. usage 汇总

当前还没负责：

1. 更复杂的 message bus mode
2. 多 worker / 并发调度
3. 更完整的 runtime lifecycle
4. multi-agent orchestration

### 4.3 `Session`

职责：外部化的运行事实存储。

当前实现重点：

1. `sessions` 表
   - projection / summary row
2. `messages` 表
   - 当前主事件流
3. `run_id`
   - 把同一个 session 里的多次 run 切开

当前主要读取接口：

1. `get_event_records(session_id)`
   - 整个 session 的完整事件流
2. `get_run_event_records(session_id, run_id)`
   - 某一次 run 的事件片段
3. `list_run_ids(session_id)`
   - 发现当前 session 中有哪些 run
4. `get_visible_history(session_id)`
   - 给 ContextBuilder 用的可见历史切片
5. `session_search`
   - 只检索可见 transcript
   - 不把 hidden prompt / skill snapshot 当成搜索候选

当前关键 hidden 事件：

1. `run_started`
2. `skill_activation_snapshotted`
3. `system_prompt_snapshotted`
4. `run_completed`
5. `run_failed`

### 4.4 `Memory`

职责：durable facts，不是 transcript。

当前实现重点：

1. curated CRUD
2. frozen snapshot
3. 每次新 run 开始时刷新 snapshot
4. 当前 run 中途写 memory 不反向污染本轮 prompt

### 4.5 `Skills`

职责：外置 skill 装配与按需查看。

当前实现重点：

1. `SkillsLoader`
   - 扫描 `workspace/skills/*/SKILL.md`
   - 扫描 builtin skills
2. `SkillAssembler`
   - 输入 task description + 候选 skill 摘要
   - 先用 embedding 做语义召回
   - 再调一次 LLM 直接选择 skills
   - 没有匹配时返回空 skills
3. `skill_view`
   - 显式加载 skill 正文或支持文件
4. activated skills
   - 按 Hermes 风格作为显式消息注入

当前 skill 语义已经定成：

1. **run-scoped**
   - skill 激活只对当前 run 生效
2. **不是 session-scoped**
   - 不默认跨 run 持久化为 session 状态
3. **explicit loading path**
    - `skill_view`
4. **no-match means no skill injection**
   - 如果 assembler 没选出 skill
   - 当前 run 不拼接 skill messages
   - 也不会写 `skill_activation_snapshotted`

### 4.6 `Tools`

当前内建工具：

1. `echo`
2. `memory`
3. `skill_view`
4. `session_search`

当前工具基础设施：

1. `ToolSpec`
2. `ObjectBackedTool`
3. `ToolRegistry`
4. `ToolExecutor`

### 4.7 `Providers`

当前已经实现：

1. provider registry
2. runtime resolution
3. main provider
4. fallback provider

当前状态：

1. fallback 已经是“每次调用都先 main，再 fallback”
2. auxiliary provider 已经可用于 skill 选择
3. auxiliary provider 还没有进入主对话 tool loop

---

## 5. 当前最重要的设计决定

这几条是现在已经定下来的，不应该再反复漂：

### 5.1 `Session-first`

当前 Beaver 明确在往这个方向走：

1. 运行事实优先写回 Session
2. Session 是 replay / audit / resume 的基础
3. prompt 不是状态源，Session 才是

### 5.2 `Harness != Product Interface`

当前主入口已经是：

- `AgentService`
- `AgentLoop`

而不是 CLI 本身。
CLI、Web、Gateway 后面都应该只是接口层。

### 5.3 `Skill selection` 外置

已经不再让 `AgentLoop` 自己“决定该选哪个 skill”，而是：

```text
task description
  -> SkillAssembler
    -> AgentLoop
```

### 5.4 `Skills` 采用 Hermes 风格

不是：

- skill 正文长期塞进 system prompt
- summary 让模型自己猜怎么展开

而是：

1. activated skill messages
2. `skill_view`

---

## 6. 当前还没完成什么

这部分是接下来继续施工的重点。

### 6.1 运行时生命周期

已做第一步：

1. `EngineLoadResult.close()`
2. `AgentLoop.close()`
3. `AgentService.close()`
4. `AgentService.shutdown()`

已做第二步的最小版本：

1. `AgentLoop.run()`
2. `AgentLoop.stop()`
3. `AgentLoop.submit_direct()`

还没做：

1. 统一 shutdown hooks
2. 更完整的 provider/client 资源释放协议
3. 多 worker / bus / 调度策略

### 6.2 Web / Gateway 接主链

现在主链已经能跑，但还没正式变成：

1. Web 真正调用 `AgentService.process_direct()`
2. Gateway 真正调用 `AgentService.process_direct()`

### 6.3 Session 更完整的 event-source 能力

还没做：

1. checkpoint
2. rewind
3. fork session
4. crash-resume protocol

### 6.4 Multi-agent / swarms

还没正式接回主链：

1. delegation
2. team runtime
3. swarms orchestration backend

但 lifecycle 关系已经先定下来了：

1. team 不会共享一个大 `AgentLoop` 跑所有成员
2. 每个 team member 都应有自己独立的 `AgentService / AgentLoop`
3. team coordinator 在上层调度多个 member 实例
4. 因此当前这套 `start()/submit_direct()/stop()/close()` 首先是 member-level lifecycle
2. team runtime
3. swarms backend
4. group discussion / workflow orchestration

### 6.5 权限与治理

还没做：

1. permission gates
2. tool policy
3. MCP 工具治理

---

## 7. 下一步从哪开始最合理

如果现在继续施工，最合理的顺序是：

1. 先把 `flow.md` 作为当前基线固定下来
2. 再继续第 6 阶段：
   - runtime lifecycle
   - `boot / close / run / stop`
3. 然后再接：
   - Web / Gateway
4. 最后才是：
   - multi-agent / swarms

一句话总结：

**当前 Beaver 已经有一个可运行的单 agent runtime；接下来不是继续堆局部能力，而是把它升级成有完整生命周期的标准 harness。**

---

## 8. 文档维护要求

以后只要发生以下任一变动，必须同步更新本文件：

1. `EngineLoader` 装配项变化
2. `AgentLoop` 主链变化
3. `Session` 事件流结构变化
4. `Memory` 接入方式变化
5. `Skills` 装配方式变化
6. `Tools` 默认集合变化
7. Web / Gateway / multi-agent 真正接入主链