- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化 - 移除内置 agents 配置以简化系统架构 - 更新 ContextBuilder 使用动态提示词模板而非硬编码内容 - 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数 - 添加输出语言指令确保用户界面内容按指定语言生成 - 扩展前端 LanguageSwitcher 组件支持三种语言选项 - 优化 Header 和侧边栏组件的响应式布局和文本截断处理 - 更新测试用例验证不同语言环境下的提示词正确性
379 lines
9.5 KiB
Markdown
379 lines
9.5 KiB
Markdown
# Beaver Validation Plan
|
|
|
|
Date: 2026-06-09
|
|
|
|
Purpose: validate Beaver as a whole product before broader rollout.
|
|
|
|
## 1. Validation Strategy
|
|
|
|
Beaver should be validated through real workflows, not through opinions about AI.
|
|
|
|
The validation sequence:
|
|
|
|
```text
|
|
customer problem
|
|
-> workflow fit
|
|
-> first-run onboarding
|
|
-> task execution
|
|
-> evidence comprehension
|
|
-> acceptance/revision
|
|
-> skill reuse
|
|
-> deployment and operations
|
|
-> security/governance
|
|
```
|
|
|
|
## 2. Validation Questions
|
|
|
|
### Product Value
|
|
|
|
- Does Beaver solve a painful enough workflow problem?
|
|
- Does task acceptance make AI work feel more reliable?
|
|
- Do users complete more usable work than with chat-only AI?
|
|
- Does skill reuse save time after repeated workflows?
|
|
|
|
### Usability
|
|
|
|
- Can users understand when chat becomes a task?
|
|
- Can users find task evidence and artifacts?
|
|
- Can users accept, revise, or abandon without confusion?
|
|
- Can admins configure providers and connectors without engineering help?
|
|
|
|
### Technical Feasibility
|
|
|
|
- Can fresh deployments be created repeatably?
|
|
- Can app instances stay isolated?
|
|
- Can Agent tasks run reliably with files, tools, skills, and scheduled jobs?
|
|
- Can failures be diagnosed from status/logs/events?
|
|
|
|
### Governance And Security
|
|
|
|
- Are control-plane services private?
|
|
- Are file and workspace boundaries enforced?
|
|
- Are tool calls recorded and reviewable?
|
|
- Are external connector writes controlled?
|
|
- Is memory inspectable and controllable before broad use?
|
|
|
|
### Business Viability
|
|
|
|
- Does a pilot team have enough recurring workflows?
|
|
- Can the product produce measurable weekly value?
|
|
- Can an admin operate it with acceptable support load?
|
|
- Can the buyer justify expansion?
|
|
|
|
## 3. Pilot Workflow Candidates
|
|
|
|
| Workflow | Why It Fits | Required Capabilities | Success Signal |
|
|
| --- | --- | --- | --- |
|
|
| Weekly project report | Recurring, evidence-sensitive, review-heavy | scheduled work, files, task acceptance, artifacts | Report accepted weekly |
|
|
| Project brief / proposal | Multi-step, document-heavy, revision-heavy | chat, files, tools, task timeline, revisions | Brief accepted after fewer rounds |
|
|
| Document review | Clear deliverable and evidence need | files, task timeline, artifacts, acceptance | Review output accepted |
|
|
| Support triage | Tool/context-heavy and repeatable | tasks, tools, memory, maybe connector | Triage summary accepted |
|
|
| Research synthesis | Agent team fit, artifact-heavy | multi-agent, web/search, files, evidence | Synthesis accepted and reused |
|
|
|
|
Recommended first pilot:
|
|
|
|
1. Project brief or document review for manual task loop.
|
|
2. Weekly project report for scheduled workflow.
|
|
3. Skill reuse from the accepted outputs.
|
|
|
|
## 4. Customer Discovery Validation
|
|
|
|
### Participants
|
|
|
|
- 5 end users.
|
|
- 3 workflow owners.
|
|
- 3 admins/platform owners.
|
|
- 2 security reviewers.
|
|
- 2 operators/engineers.
|
|
|
|
### Method
|
|
|
|
- 45-minute interviews using past-behavior questions.
|
|
- 60-minute workflow walkthrough with Beaver.
|
|
- Follow-up after one week of usage.
|
|
|
|
### Evidence To Collect
|
|
|
|
- Current workflow steps.
|
|
- Time spent today.
|
|
- Existing tools/files/systems involved.
|
|
- Review/approval requirements.
|
|
- Trust blockers.
|
|
- Repeat frequency.
|
|
- What would count as a successful pilot.
|
|
|
|
### Pass Criteria
|
|
|
|
- At least 3 workflows are repeated weekly or more.
|
|
- At least 2 workflows involve files or external tools.
|
|
- At least 2 stakeholders require evidence/auditability.
|
|
- At least 1 team lead agrees to a real pilot workflow.
|
|
|
|
## 5. Product Workflow Validation
|
|
|
|
### Test 1: First Accepted Task
|
|
|
|
Goal: user reaches first accepted task.
|
|
|
|
Steps:
|
|
|
|
1. Register or log in.
|
|
2. Configure provider.
|
|
3. Start from a suggested workflow or freeform chat.
|
|
4. Upload or reference a file if needed.
|
|
5. Let Beaver create/continue a task.
|
|
6. Inspect output and evidence.
|
|
7. Accept or request revision.
|
|
|
|
Pass criteria:
|
|
|
|
- User completes without developer assistance.
|
|
- First accepted task occurs in one session.
|
|
- User can explain what Beaver did.
|
|
|
|
### Test 2: Revision Loop
|
|
|
|
Goal: prove Beaver handles "not good enough yet."
|
|
|
|
Steps:
|
|
|
|
1. Run a task.
|
|
2. Ask for a specific revision.
|
|
3. Confirm the same task context continues.
|
|
4. Accept revised output.
|
|
|
|
Pass criteria:
|
|
|
|
- Revision feedback is preserved.
|
|
- Task timeline shows revision.
|
|
- User does not need to restate full context.
|
|
|
|
### Test 3: Evidence Review
|
|
|
|
Goal: verify trust and auditability.
|
|
|
|
Steps:
|
|
|
|
1. Give reviewer a completed task detail page.
|
|
2. Ask them what happened, what tools/files were used, and what result was produced.
|
|
3. Ask whether they would approve the output.
|
|
|
|
Pass criteria:
|
|
|
|
- >=80% reviewers identify the key actions and artifacts.
|
|
- Reviewers can state at least one risk or confidence reason.
|
|
|
|
### Test 4: Skill Reuse
|
|
|
|
Goal: prove accepted work can compound.
|
|
|
|
Steps:
|
|
|
|
1. Accept a task.
|
|
2. Generate skill candidate/draft.
|
|
3. Review and publish skill.
|
|
4. Run a similar task.
|
|
5. Check whether skill activates and improves work.
|
|
|
|
Pass criteria:
|
|
|
|
- At least 3 pilot skills are reused twice.
|
|
- Users report lower effort on repeated task.
|
|
|
|
### Test 5: Scheduled Workflow
|
|
|
|
Goal: validate proactive work.
|
|
|
|
Steps:
|
|
|
|
1. Create scheduled job.
|
|
2. Trigger or wait for scheduled run.
|
|
3. Review notification/output.
|
|
4. Accept or revise.
|
|
|
|
Pass criteria:
|
|
|
|
- Scheduled run is visible.
|
|
- Output enters review flow.
|
|
- Failed run has clear recovery path.
|
|
|
|
## 6. Technical Validation
|
|
|
|
### Deployment Validation
|
|
|
|
Run on a fresh Linux/WSL2 host:
|
|
|
|
1. Build images.
|
|
2. Create Docker network.
|
|
3. Start router proxy.
|
|
4. Start authz service.
|
|
5. Start deploy control.
|
|
6. Start auth portal.
|
|
7. Register user.
|
|
8. Configure provider.
|
|
9. Open app instance.
|
|
10. Complete first task.
|
|
|
|
Pass criteria:
|
|
|
|
- Under 2 hours with docs only.
|
|
- No undocumented environment variables.
|
|
- Public exposure limited to auth portal and router proxy.
|
|
|
|
### Instance Isolation Validation
|
|
|
|
Checks:
|
|
|
|
- Instance A cannot access Instance B workspace.
|
|
- User file roots stay scoped.
|
|
- Router sends host to correct container.
|
|
- Provider config is instance-specific.
|
|
- Deleting one instance does not affect another.
|
|
|
|
Pass criteria:
|
|
|
|
- No cross-instance reads/writes.
|
|
- Registry state remains consistent.
|
|
|
|
### Runtime Validation
|
|
|
|
Checks:
|
|
|
|
- Chat API.
|
|
- WebSocket/runtime status.
|
|
- Task creation and deletion.
|
|
- Task detail events.
|
|
- File upload/preview/download/delete.
|
|
- Tool test.
|
|
- Skill candidate/draft/review/publish.
|
|
- Cron create/toggle/run/delete.
|
|
- Settings provider save.
|
|
- Runtime restart.
|
|
|
|
Pass criteria:
|
|
|
|
- Critical user flows pass on desktop and mobile viewport.
|
|
- Failure states have visible recovery.
|
|
|
|
## 7. Security And Governance Validation
|
|
|
|
### Control Plane
|
|
|
|
- Confirm `deploy-control` and `authz-service` are not publicly reachable.
|
|
- Confirm tokens are required for control-plane calls.
|
|
- Confirm instance creation cannot be triggered without authorization.
|
|
|
|
### Files
|
|
|
|
- Confirm only allowed user roots are visible.
|
|
- Confirm absolute-style or cross-prefix paths are rejected.
|
|
- Confirm delete operations require explicit user action.
|
|
|
|
### Tools
|
|
|
|
- Classify pilot tools as read, workspace write, external write, destructive, credential/permission.
|
|
- Record tool calls in task evidence.
|
|
- Block or require review for dangerous actions.
|
|
|
|
### Connectors
|
|
|
|
- Use sandbox/test accounts for pilot when possible.
|
|
- Confirm callback base URL is per-instance.
|
|
- Confirm disconnect/reconnect path.
|
|
|
|
### Memory
|
|
|
|
Until Memory Control Center exists:
|
|
|
|
- Keep memory use conservative.
|
|
- Document what is stored.
|
|
- Avoid enabling opaque long-term memory for sensitive pilots.
|
|
|
|
## 8. Usability Validation
|
|
|
|
Viewports:
|
|
|
|
- 320px.
|
|
- 375px.
|
|
- 390px.
|
|
- 768px.
|
|
- 1024px.
|
|
- 1365px.
|
|
- One mobile landscape viewport.
|
|
|
|
Screens:
|
|
|
|
- Auth portal login/register/provider onboarding.
|
|
- Chat workbench.
|
|
- Task list/detail.
|
|
- Files.
|
|
- Skills.
|
|
- Marketplace.
|
|
- Tools.
|
|
- Notifications/cron.
|
|
- Outlook/connectors if in pilot.
|
|
- Settings/status/logs.
|
|
|
|
Pass criteria:
|
|
|
|
- No horizontal overflow.
|
|
- No inaccessible critical controls.
|
|
- Touch targets are usable.
|
|
- Loading, empty, error, success, and disabled states are visible.
|
|
|
|
## 9. Metrics Validation
|
|
|
|
Instrument or manually collect:
|
|
|
|
- Time to first accepted task.
|
|
- Accepted tasks per user/team/week.
|
|
- Acceptance rate.
|
|
- Revision rate.
|
|
- Task run failure rate.
|
|
- Evidence coverage.
|
|
- Skill candidates.
|
|
- Skill drafts.
|
|
- Published skills.
|
|
- Skill reuse.
|
|
- Scheduled run success.
|
|
- Provider setup failure.
|
|
- Instance creation failure.
|
|
- Connector setup failure.
|
|
|
|
Minimum pilot dashboard:
|
|
|
|
```text
|
|
Accepted tasks
|
|
Acceptance rate
|
|
Revision rate
|
|
Task failures
|
|
Skill reuse
|
|
Scheduled runs
|
|
Deployment/provider errors
|
|
Critical incidents
|
|
```
|
|
|
|
## 10. Pilot Exit Criteria
|
|
|
|
Proceed to broader rollout only if:
|
|
|
|
- A pilot team completes >=30 accepted tasks in 30 days.
|
|
- At least 2 recurring workflows are active.
|
|
- At least 5 skills are created and 3 reused twice.
|
|
- Task acceptance rate is >=60%.
|
|
- No critical security or deployment incidents occur.
|
|
- Fresh deployment can be repeated from docs.
|
|
- Admin can diagnose common failures from status/logs/runbook.
|
|
- Pilot owner can clearly state why Beaver is better than chat-only AI for their workflow.
|
|
|
|
## 11. Decision Matrix
|
|
|
|
| Result | Decision |
|
|
| --- | --- |
|
|
| High task acceptance, low skill reuse | Improve skill learning and workflow templates |
|
|
| High interest, deployment friction | Invest in deploy runbook and health console |
|
|
| Good demos, low recurring use | Revisit target segment and workflow selection |
|
|
| High usage, trust concerns | Prioritize evidence narrative, policy, memory controls |
|
|
| Connector demand dominates | Narrow connector roadmap to one high-value system |
|
|
| Memory concerns dominate | Build Memory Control Center before expansion |
|