- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化 - 移除内置 agents 配置以简化系统架构 - 更新 ContextBuilder 使用动态提示词模板而非硬编码内容 - 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数 - 添加输出语言指令确保用户界面内容按指定语言生成 - 扩展前端 LanguageSwitcher 组件支持三种语言选项 - 优化 Header 和侧边栏组件的响应式布局和文本截断处理 - 更新测试用例验证不同语言环境下的提示词正确性
9.5 KiB
Beaver Validation Plan
Date: 2026-06-09
Purpose: validate Beaver as a whole product before broader rollout.
1. Validation Strategy
Beaver should be validated through real workflows, not through opinions about AI.
The validation sequence:
customer problem
-> workflow fit
-> first-run onboarding
-> task execution
-> evidence comprehension
-> acceptance/revision
-> skill reuse
-> deployment and operations
-> security/governance
2. Validation Questions
Product Value
- Does Beaver solve a painful enough workflow problem?
- Does task acceptance make AI work feel more reliable?
- Do users complete more usable work than with chat-only AI?
- Does skill reuse save time after repeated workflows?
Usability
- Can users understand when chat becomes a task?
- Can users find task evidence and artifacts?
- Can users accept, revise, or abandon without confusion?
- Can admins configure providers and connectors without engineering help?
Technical Feasibility
- Can fresh deployments be created repeatably?
- Can app instances stay isolated?
- Can Agent tasks run reliably with files, tools, skills, and scheduled jobs?
- Can failures be diagnosed from status/logs/events?
Governance And Security
- Are control-plane services private?
- Are file and workspace boundaries enforced?
- Are tool calls recorded and reviewable?
- Are external connector writes controlled?
- Is memory inspectable and controllable before broad use?
Business Viability
- Does a pilot team have enough recurring workflows?
- Can the product produce measurable weekly value?
- Can an admin operate it with acceptable support load?
- Can the buyer justify expansion?
3. Pilot Workflow Candidates
| Workflow | Why It Fits | Required Capabilities | Success Signal |
|---|---|---|---|
| Weekly project report | Recurring, evidence-sensitive, review-heavy | scheduled work, files, task acceptance, artifacts | Report accepted weekly |
| Project brief / proposal | Multi-step, document-heavy, revision-heavy | chat, files, tools, task timeline, revisions | Brief accepted after fewer rounds |
| Document review | Clear deliverable and evidence need | files, task timeline, artifacts, acceptance | Review output accepted |
| Support triage | Tool/context-heavy and repeatable | tasks, tools, memory, maybe connector | Triage summary accepted |
| Research synthesis | Agent team fit, artifact-heavy | multi-agent, web/search, files, evidence | Synthesis accepted and reused |
Recommended first pilot:
- Project brief or document review for manual task loop.
- Weekly project report for scheduled workflow.
- Skill reuse from the accepted outputs.
4. Customer Discovery Validation
Participants
- 5 end users.
- 3 workflow owners.
- 3 admins/platform owners.
- 2 security reviewers.
- 2 operators/engineers.
Method
- 45-minute interviews using past-behavior questions.
- 60-minute workflow walkthrough with Beaver.
- Follow-up after one week of usage.
Evidence To Collect
- Current workflow steps.
- Time spent today.
- Existing tools/files/systems involved.
- Review/approval requirements.
- Trust blockers.
- Repeat frequency.
- What would count as a successful pilot.
Pass Criteria
- At least 3 workflows are repeated weekly or more.
- At least 2 workflows involve files or external tools.
- At least 2 stakeholders require evidence/auditability.
- At least 1 team lead agrees to a real pilot workflow.
5. Product Workflow Validation
Test 1: First Accepted Task
Goal: user reaches first accepted task.
Steps:
- Register or log in.
- Configure provider.
- Start from a suggested workflow or freeform chat.
- Upload or reference a file if needed.
- Let Beaver create/continue a task.
- Inspect output and evidence.
- Accept or request revision.
Pass criteria:
- User completes without developer assistance.
- First accepted task occurs in one session.
- User can explain what Beaver did.
Test 2: Revision Loop
Goal: prove Beaver handles "not good enough yet."
Steps:
- Run a task.
- Ask for a specific revision.
- Confirm the same task context continues.
- Accept revised output.
Pass criteria:
- Revision feedback is preserved.
- Task timeline shows revision.
- User does not need to restate full context.
Test 3: Evidence Review
Goal: verify trust and auditability.
Steps:
- Give reviewer a completed task detail page.
- Ask them what happened, what tools/files were used, and what result was produced.
- Ask whether they would approve the output.
Pass criteria:
-
=80% reviewers identify the key actions and artifacts.
- Reviewers can state at least one risk or confidence reason.
Test 4: Skill Reuse
Goal: prove accepted work can compound.
Steps:
- Accept a task.
- Generate skill candidate/draft.
- Review and publish skill.
- Run a similar task.
- Check whether skill activates and improves work.
Pass criteria:
- At least 3 pilot skills are reused twice.
- Users report lower effort on repeated task.
Test 5: Scheduled Workflow
Goal: validate proactive work.
Steps:
- Create scheduled job.
- Trigger or wait for scheduled run.
- Review notification/output.
- Accept or revise.
Pass criteria:
- Scheduled run is visible.
- Output enters review flow.
- Failed run has clear recovery path.
6. Technical Validation
Deployment Validation
Run on a fresh Linux/WSL2 host:
- Build images.
- Create Docker network.
- Start router proxy.
- Start authz service.
- Start deploy control.
- Start auth portal.
- Register user.
- Configure provider.
- Open app instance.
- Complete first task.
Pass criteria:
- Under 2 hours with docs only.
- No undocumented environment variables.
- Public exposure limited to auth portal and router proxy.
Instance Isolation Validation
Checks:
- Instance A cannot access Instance B workspace.
- User file roots stay scoped.
- Router sends host to correct container.
- Provider config is instance-specific.
- Deleting one instance does not affect another.
Pass criteria:
- No cross-instance reads/writes.
- Registry state remains consistent.
Runtime Validation
Checks:
- Chat API.
- WebSocket/runtime status.
- Task creation and deletion.
- Task detail events.
- File upload/preview/download/delete.
- Tool test.
- Skill candidate/draft/review/publish.
- Cron create/toggle/run/delete.
- Settings provider save.
- Runtime restart.
Pass criteria:
- Critical user flows pass on desktop and mobile viewport.
- Failure states have visible recovery.
7. Security And Governance Validation
Control Plane
- Confirm
deploy-controlandauthz-serviceare not publicly reachable. - Confirm tokens are required for control-plane calls.
- Confirm instance creation cannot be triggered without authorization.
Files
- Confirm only allowed user roots are visible.
- Confirm absolute-style or cross-prefix paths are rejected.
- Confirm delete operations require explicit user action.
Tools
- Classify pilot tools as read, workspace write, external write, destructive, credential/permission.
- Record tool calls in task evidence.
- Block or require review for dangerous actions.
Connectors
- Use sandbox/test accounts for pilot when possible.
- Confirm callback base URL is per-instance.
- Confirm disconnect/reconnect path.
Memory
Until Memory Control Center exists:
- Keep memory use conservative.
- Document what is stored.
- Avoid enabling opaque long-term memory for sensitive pilots.
8. Usability Validation
Viewports:
- 320px.
- 375px.
- 390px.
- 768px.
- 1024px.
- 1365px.
- One mobile landscape viewport.
Screens:
- Auth portal login/register/provider onboarding.
- Chat workbench.
- Task list/detail.
- Files.
- Skills.
- Marketplace.
- Tools.
- Notifications/cron.
- Outlook/connectors if in pilot.
- Settings/status/logs.
Pass criteria:
- No horizontal overflow.
- No inaccessible critical controls.
- Touch targets are usable.
- Loading, empty, error, success, and disabled states are visible.
9. Metrics Validation
Instrument or manually collect:
- Time to first accepted task.
- Accepted tasks per user/team/week.
- Acceptance rate.
- Revision rate.
- Task run failure rate.
- Evidence coverage.
- Skill candidates.
- Skill drafts.
- Published skills.
- Skill reuse.
- Scheduled run success.
- Provider setup failure.
- Instance creation failure.
- Connector setup failure.
Minimum pilot dashboard:
Accepted tasks
Acceptance rate
Revision rate
Task failures
Skill reuse
Scheduled runs
Deployment/provider errors
Critical incidents
10. Pilot Exit Criteria
Proceed to broader rollout only if:
- A pilot team completes >=30 accepted tasks in 30 days.
- At least 2 recurring workflows are active.
- At least 5 skills are created and 3 reused twice.
- Task acceptance rate is >=60%.
- No critical security or deployment incidents occur.
- Fresh deployment can be repeated from docs.
- Admin can diagnose common failures from status/logs/runbook.
- Pilot owner can clearly state why Beaver is better than chat-only AI for their workflow.
11. Decision Matrix
| Result | Decision |
|---|---|
| High task acceptance, low skill reuse | Improve skill learning and workflow templates |
| High interest, deployment friction | Invest in deploy runbook and health console |
| Good demos, low recurring use | Revisit target segment and workflow selection |
| High usage, trust concerns | Prioritize evidence narrative, policy, memory controls |
| Connector demand dominates | Narrow connector roadmap to one high-value system |
| Memory concerns dominate | Build Memory Control Center before expansion |