Files
beaver_project/docs/product-discovery/beaver/launch-maintenance-runbook.md
steven_li fc9fd93c36 feat: 支持多语言提示词本地化和界面优化
- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化
- 移除内置 agents 配置以简化系统架构
- 更新 ContextBuilder 使用动态提示词模板而非硬编码内容
- 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数
- 添加输出语言指令确保用户界面内容按指定语言生成
- 扩展前端 LanguageSwitcher 组件支持三种语言选项
- 优化 Header 和侧边栏组件的响应式布局和文本截断处理
- 更新测试用例验证不同语言环境下的提示词正确性
2026-06-10 16:11:05 +08:00

12 KiB

Beaver Launch And Maintenance Runbook

Date: 2026-06-09

Scope: whole Beaver product.

1. Launch Principle

Launch Beaver through controlled pilots before broad rollout.

The product has a wide operational surface: auth, deployment control, routing, per-instance app containers, model providers, Agent runtime, tools, files, skills, memory, scheduled work, and connectors. A successful launch depends as much on reliability and trust as on feature completeness.

2. Launch Roles

Role Responsibility
Launch owner Owns readiness, go/no-go, rollout phases
Deployment owner Owns Docker images, network, router, instance lifecycle
Backend owner Owns Agent runtime, tasks, tools, skills, cron, APIs
Frontend owner Owns user-facing flows and UI verification
Security owner Owns control-plane exposure, data boundaries, tool/connector policy
Pilot owner Owns user onboarding, workflow selection, feedback, metrics
Support owner Owns incident triage, runbook updates, user support

3. Launch Phases

Phase 0: Local Internal Readiness

Audience: builders and internal testers.

Goals:

  • Full local deployment works.
  • Core demo flows are stable.
  • Known risks are documented.

Required flows:

  • Register/login.
  • Provider onboarding.
  • First chat response.
  • Chat-to-task.
  • Task acceptance/revision.
  • File upload/preview/download/delete.
  • Skill list/candidate/draft/review.
  • Settings/status/restart.

Exit criteria:

  • Fresh deployment run completed from docs.
  • No P0 or launch-blocking P1 issues.
  • Demo script works end to end.

Phase 1: Controlled Pilot

Audience: one internal team or one trusted customer team.

Goals:

  • Validate real workflow value.
  • Validate deployment and support process.
  • Validate trust, evidence, and governance story.

Constraints:

  • Narrow workflow scope.
  • Narrow connector scope.
  • Conservative tool policy.
  • Human review for skill publishing.
  • No opaque memory use for sensitive data.

Exit criteria:

  • =30 accepted tasks in 30 days.

  • =2 recurring workflows.

  • 0 critical incidents.
  • Deployment/support issues documented and reduced.

Phase 2: Expanded Pilot

Audience: more users in same team or a second pilot team.

Goals:

  • Test repeatability across workflows.
  • Introduce Memory Control Center or stricter memory policy if ready.
  • Strengthen skill reuse and scheduled work.

Exit criteria:

  • Skill reuse becomes visible.
  • Admin can operate without developer pairing for common tasks.
  • Evidence and report quality are accepted by workflow owner.

Phase 3: Production Candidate

Audience: broader customer or department rollout.

Goals:

  • Stabilized deployment.
  • Health monitoring.
  • Incident response.
  • Backup/restore process.
  • Policy profiles.

Exit criteria:

  • Launch owner, security owner, and deployment owner approve.
  • Support process has clear ownership.
  • Rollback and restore are rehearsed.

4. Pre-Launch Checklist

Deployment

  • Images build successfully.
  • Docker network exists.
  • Router proxy starts.
  • AuthZ service starts.
  • Deploy control starts.
  • Auth portal starts.
  • App instance can be created.
  • App instance route works through router proxy.
  • Provider config can be written and instance restarted.
  • Runtime directories are persistent.
  • Public exposure limited to intended services.

Product Flows

  • Register/login works.
  • Provider onboarding works.
  • Chat workbench loads.
  • Task creation works.
  • Task detail timeline works.
  • Acceptance/revision/abandon works.
  • Files page works.
  • Tools page works for pilot tools.
  • Skills page works.
  • Marketplace install works if included.
  • Cron/scheduled flow works if included.
  • Connector flow works if included.
  • Settings/status/logs work.

Governance

  • Tool policy for pilot is documented.
  • Connector side effects are understood.
  • Skill publish gates are documented.
  • Memory behavior is documented.
  • Data retention expectations are documented.
  • User-facing limitations are documented.

Support

  • Pilot support channel exists.
  • Incident owner assigned.
  • Logs and health checks are accessible.
  • Backup/restore expectations are clear.
  • Known issues list exists.

5. Monitoring

Product Metrics

Metric Owner Cadence
Accepted tasks Pilot owner Weekly
Acceptance rate Product owner Weekly
Revision rate Product owner Weekly
Active workflows Pilot owner Weekly
Skill candidates and reuse Product owner Weekly
Scheduled run success Backend owner Weekly
Time to first accepted task Product/design Per onboarding

Operational Metrics

Metric Owner Alert
Instance creation failures Deployment owner >10% during pilot
Router route failures Deployment owner Any repeated failure
Provider setup failures Support owner >20% of onboarded users
Task run failures Backend owner >20% for 2 days
WebSocket/runtime disconnects Backend/frontend Repeated user-visible failures
File operation failures Backend owner Any permission/path issue
Tool execution failures Backend owner Repeated by tool category
Cron failures Backend owner Any critical scheduled workflow missed
Connector failures Integration owner Failed auth or unintended write

Security Metrics

Metric Alert
Control-plane public exposure Immediate P0
Cross-instance data access Immediate P0
Unintended external write Immediate P0
Credential leak in logs/report Immediate P0
Unsafe skill publish P1, or P0 if external action risk

6. Health Checks

Control Plane

  • Auth portal reachable.
  • AuthZ service reachable internally.
  • Deploy control reachable internally with token.
  • Router proxy has generated routes.
  • Instance registry is readable and consistent.

App Instance

  • Frontend loads.
  • Backend /api/status responds.
  • WebSocket works.
  • Provider config present.
  • Workspace path mounted.
  • Initial skills present.
  • Logs accessible.

Product Runtime

  • Chat request succeeds.
  • Task run succeeds.
  • File API succeeds.
  • Tool registry loads.
  • Skills list loads.
  • Cron scheduler active if enabled.
  • Connector status loads if enabled.

7. Incident Response

P0: Control Plane Exposed

Examples:

  • deploy-control accessible from public internet.
  • authz-service accessible from public internet.
  • Internal token leaked.

Actions:

  1. Remove public route/firewall exposure.
  2. Rotate affected tokens.
  3. Review access logs.
  4. Confirm no unauthorized instance operations.
  5. Update deployment checklist.

P0: Cross-Instance Data Leak

Examples:

  • Instance A reads Instance B workspace.
  • Router sends user to wrong instance.
  • Shared connector callback writes to wrong instance.

Actions:

  1. Disable affected route or instance.
  2. Preserve logs and registry.
  3. Identify path/host/callback mapping failure.
  4. Patch and add regression test.
  5. Notify affected stakeholders.

P0: Unintended External Action

Examples:

  • Email or IM message sent unexpectedly.
  • Calendar invite created unexpectedly.
  • External system updated without user intent.

Actions:

  1. Disable connector or tool.
  2. Preserve task/tool evidence.
  3. Identify initiating task, tool, arguments, user, connector account.
  4. Patch policy or confirmation gate.
  5. Add test case and update pilot policy.

P1: New User Cannot Reach Instance

Actions:

  1. Check auth portal logs.
  2. Check authz register flow.
  3. Check deploy-control register/configure flow.
  4. Check instance registry.
  5. Check router route generation.
  6. Check container state and app logs.

P1: Provider Config Broken

Actions:

  1. Check settings/status.
  2. Confirm config path and provider fields.
  3. Test provider credentials.
  4. Restart instance if config was changed.
  5. Improve onboarding copy if user error.

P1: Task Runtime Failing

Actions:

  1. Check backend logs.
  2. Check provider availability.
  3. Check tool registry.
  4. Check task event timeline.
  5. Reproduce with minimal chat request.
  6. Mark affected pilot workflow as paused if repeated.

P2: UI Flow Confusing

Actions:

  1. Record screen and user quote.
  2. Add to UX issue list.
  3. Determine whether it blocks pilot success.
  4. Fix copy/layout if low effort.

8. Maintenance Cadence

Daily During Pilot

  • Check critical incidents.
  • Check instance health.
  • Check failed task runs.
  • Check support channel.
  • Review provider/connector errors.

Weekly

  • Review accepted tasks and acceptance rate.
  • Review workflow success/failure.
  • Review skill candidates and reuse.
  • Review deployment issues.
  • Review security/tool/connector events.
  • Update known issues and runbook.

Monthly

  • Rehearse fresh deployment.
  • Review backup/restore approach.
  • Review memory and skill governance.
  • Review connector roadmap.
  • Review pilot ROI and expansion decision.

Quarterly

  • Revisit product positioning.
  • Revisit architecture scaling assumptions.
  • Decide team workspace / RBAC roadmap.
  • Review security model and policy profiles.

9. Backup And Restore

Minimum data to preserve:

  • authz-service/runtime/data
  • app-instance/runtime/instances
  • app-instance/runtime/registry
  • router-proxy/runtime/conf.d

Per instance:

  • beaver-home/config.json
  • beaver-home/web_auth_users.json
  • beaver-home/workspace/
  • skill and runtime state under instance data.

Pilot requirements:

  • Document manual backup command.
  • Document manual restore procedure.
  • Test restore for at least one non-production instance before expanded pilot.

10. Change Management

Before changing any of these, require launch owner review:

  • Routing/proxy config.
  • AuthZ issuer/internal URL.
  • Deploy token names or values.
  • Instance registry format.
  • Workspace mount paths.
  • Provider config schema.
  • Tool execution policy.
  • Connector callback routing.
  • Skill publish gates.
  • Memory default behavior.

11. Rollback

Rollback options:

  • Roll back frontend/backend image for app instances.
  • Disable specific connector.
  • Disable scheduled job execution.
  • Disable skill learning worker.
  • Disable skill publish.
  • Fall back to chat-only mode for affected workflow.
  • Remove public route to affected instance.
  • Restore instance data from backup.

Rollback triggers:

  • P0 incident.
  • Repeated instance creation failure.
  • Repeated task runtime failure blocking pilot work.
  • Provider config issue affecting most users.
  • Connector side-effect risk.
  • UI issue blocking first accepted task.

12. Launch Communication

Internal

Beaver is launching as a controlled Agent execution pilot. The launch goal is not maximum feature breadth. The goal is to prove repeatable AI-assisted work with task acceptance, evidence, and reuse.

Pilot Users

Use Beaver for selected workflows where you need a concrete output. Review each result. Accept it if usable, request revision if it is close, or abandon it if it is not worth continuing. Your feedback is the signal that helps Beaver improve and reuse work.

Admins

Treat Beaver as an app platform with a control plane and per-instance runtime. Keep deploy-control and authz private. Monitor instance health, provider config, tool behavior, and connector side effects.

13. Known Limitations To Disclose

  • Memory is not yet fully productized with user controls.
  • Connector maturity varies by provider.
  • The first pilot should use a narrow set of workflows.
  • Some operations may still require engineering support.
  • Skill learning needs human review before publish.
  • Multi-user organization features are not the first pilot focus.

14. Go / No-Go Criteria

Go if:

  • Fresh deployment works.
  • First accepted task flow works.
  • Evidence timeline is readable enough for pilot.
  • Tool and connector policy is documented.
  • Support owner is assigned.
  • No critical security issue is open.

No-go if:

  • Control-plane exposure risk is unresolved.
  • Cross-instance isolation is unverified.
  • Provider onboarding fails for most users.
  • Task runtime is unreliable.
  • Pilot workflow is not defined.
  • No one owns incidents or support.