Files

steven_li fc9fd93c36 feat: 支持多语言提示词本地化和界面优化

- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化
- 移除内置 agents 配置以简化系统架构
- 更新 ContextBuilder 使用动态提示词模板而非硬编码内容
- 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数
- 添加输出语言指令确保用户界面内容按指定语言生成
- 扩展前端 LanguageSwitcher 组件支持三种语言选项
- 优化 Header 和侧边栏组件的响应式布局和文本截断处理
- 更新测试用例验证不同语言环境下的提示词正确性

2026-06-10 16:11:05 +08:00

9.5 KiB

Raw Permalink Blame History

Beaver Validation Plan

Date: 2026-06-09

Purpose: validate Beaver as a whole product before broader rollout.

1. Validation Strategy

Beaver should be validated through real workflows, not through opinions about AI.

The validation sequence:

customer problem
  -> workflow fit
  -> first-run onboarding
  -> task execution
  -> evidence comprehension
  -> acceptance/revision
  -> skill reuse
  -> deployment and operations
  -> security/governance

2. Validation Questions

Product Value

Does Beaver solve a painful enough workflow problem?
Does task acceptance make AI work feel more reliable?
Do users complete more usable work than with chat-only AI?
Does skill reuse save time after repeated workflows?

Usability

Can users understand when chat becomes a task?
Can users find task evidence and artifacts?
Can users accept, revise, or abandon without confusion?
Can admins configure providers and connectors without engineering help?

Technical Feasibility

Can fresh deployments be created repeatably?
Can app instances stay isolated?
Can Agent tasks run reliably with files, tools, skills, and scheduled jobs?
Can failures be diagnosed from status/logs/events?

Governance And Security

Are control-plane services private?
Are file and workspace boundaries enforced?
Are tool calls recorded and reviewable?
Are external connector writes controlled?
Is memory inspectable and controllable before broad use?

Business Viability

Does a pilot team have enough recurring workflows?
Can the product produce measurable weekly value?
Can an admin operate it with acceptable support load?
Can the buyer justify expansion?

3. Pilot Workflow Candidates

Workflow	Why It Fits	Required Capabilities	Success Signal
Weekly project report	Recurring, evidence-sensitive, review-heavy	scheduled work, files, task acceptance, artifacts	Report accepted weekly
Project brief / proposal	Multi-step, document-heavy, revision-heavy	chat, files, tools, task timeline, revisions	Brief accepted after fewer rounds
Document review	Clear deliverable and evidence need	files, task timeline, artifacts, acceptance	Review output accepted
Support triage	Tool/context-heavy and repeatable	tasks, tools, memory, maybe connector	Triage summary accepted
Research synthesis	Agent team fit, artifact-heavy	multi-agent, web/search, files, evidence	Synthesis accepted and reused

Recommended first pilot:

Project brief or document review for manual task loop.
Weekly project report for scheduled workflow.
Skill reuse from the accepted outputs.

4. Customer Discovery Validation

Participants

5 end users.
3 workflow owners.
3 admins/platform owners.
2 security reviewers.
2 operators/engineers.

Method

45-minute interviews using past-behavior questions.
60-minute workflow walkthrough with Beaver.
Follow-up after one week of usage.

Evidence To Collect

Current workflow steps.
Time spent today.
Existing tools/files/systems involved.
Review/approval requirements.
Trust blockers.
Repeat frequency.
What would count as a successful pilot.

Pass Criteria

At least 3 workflows are repeated weekly or more.
At least 2 workflows involve files or external tools.
At least 2 stakeholders require evidence/auditability.
At least 1 team lead agrees to a real pilot workflow.

5. Product Workflow Validation

Test 1: First Accepted Task

Goal: user reaches first accepted task.

Steps:

Register or log in.
Configure provider.
Start from a suggested workflow or freeform chat.
Upload or reference a file if needed.
Let Beaver create/continue a task.
Inspect output and evidence.
Accept or request revision.

Pass criteria:

User completes without developer assistance.
First accepted task occurs in one session.
User can explain what Beaver did.

Test 2: Revision Loop

Goal: prove Beaver handles "not good enough yet."

Steps:

Run a task.
Ask for a specific revision.
Confirm the same task context continues.
Accept revised output.

Pass criteria:

Revision feedback is preserved.
Task timeline shows revision.
User does not need to restate full context.

Test 3: Evidence Review

Goal: verify trust and auditability.

Steps:

Give reviewer a completed task detail page.
Ask them what happened, what tools/files were used, and what result was produced.
Ask whether they would approve the output.

Pass criteria:

=80% reviewers identify the key actions and artifacts.
Reviewers can state at least one risk or confidence reason.

Test 4: Skill Reuse

Goal: prove accepted work can compound.

Steps:

Accept a task.
Generate skill candidate/draft.
Review and publish skill.
Run a similar task.
Check whether skill activates and improves work.

Pass criteria:

At least 3 pilot skills are reused twice.
Users report lower effort on repeated task.

Test 5: Scheduled Workflow

Goal: validate proactive work.

Steps:

Create scheduled job.
Trigger or wait for scheduled run.
Review notification/output.
Accept or revise.

Pass criteria:

Scheduled run is visible.
Output enters review flow.
Failed run has clear recovery path.

6. Technical Validation

Deployment Validation

Run on a fresh Linux/WSL2 host:

Build images.
Create Docker network.
Start router proxy.
Start authz service.
Start deploy control.
Start auth portal.
Register user.
Configure provider.
Open app instance.
Complete first task.

Pass criteria:

Under 2 hours with docs only.
No undocumented environment variables.
Public exposure limited to auth portal and router proxy.

Instance Isolation Validation

Checks:

Instance A cannot access Instance B workspace.
User file roots stay scoped.
Router sends host to correct container.
Provider config is instance-specific.
Deleting one instance does not affect another.

Pass criteria:

No cross-instance reads/writes.
Registry state remains consistent.

Runtime Validation

Checks:

Chat API.
WebSocket/runtime status.
Task creation and deletion.
Task detail events.
File upload/preview/download/delete.
Tool test.
Skill candidate/draft/review/publish.
Cron create/toggle/run/delete.
Settings provider save.
Runtime restart.

Pass criteria:

Critical user flows pass on desktop and mobile viewport.
Failure states have visible recovery.

7. Security And Governance Validation

Control Plane

Confirm deploy-control and authz-service are not publicly reachable.
Confirm tokens are required for control-plane calls.
Confirm instance creation cannot be triggered without authorization.

Files

Confirm only allowed user roots are visible.
Confirm absolute-style or cross-prefix paths are rejected.
Confirm delete operations require explicit user action.

Tools

Classify pilot tools as read, workspace write, external write, destructive, credential/permission.
Record tool calls in task evidence.
Block or require review for dangerous actions.

Connectors

Use sandbox/test accounts for pilot when possible.
Confirm callback base URL is per-instance.
Confirm disconnect/reconnect path.

Memory

Until Memory Control Center exists:

Keep memory use conservative.
Document what is stored.
Avoid enabling opaque long-term memory for sensitive pilots.

8. Usability Validation

Viewports:

320px.
375px.
390px.
768px.
1024px.
1365px.
One mobile landscape viewport.

Screens:

Auth portal login/register/provider onboarding.
Chat workbench.
Task list/detail.
Files.
Skills.
Marketplace.
Tools.
Notifications/cron.
Outlook/connectors if in pilot.
Settings/status/logs.

Pass criteria:

No horizontal overflow.
No inaccessible critical controls.
Touch targets are usable.
Loading, empty, error, success, and disabled states are visible.

9. Metrics Validation

Instrument or manually collect:

Time to first accepted task.
Accepted tasks per user/team/week.
Acceptance rate.
Revision rate.
Task run failure rate.
Evidence coverage.
Skill candidates.
Skill drafts.
Published skills.
Skill reuse.
Scheduled run success.
Provider setup failure.
Instance creation failure.
Connector setup failure.

Minimum pilot dashboard:

Accepted tasks
Acceptance rate
Revision rate
Task failures
Skill reuse
Scheduled runs
Deployment/provider errors
Critical incidents

10. Pilot Exit Criteria

Proceed to broader rollout only if:

A pilot team completes >=30 accepted tasks in 30 days.
At least 2 recurring workflows are active.
At least 5 skills are created and 3 reused twice.
Task acceptance rate is >=60%.
No critical security or deployment incidents occur.
Fresh deployment can be repeated from docs.
Admin can diagnose common failures from status/logs/runbook.
Pilot owner can clearly state why Beaver is better than chat-only AI for their workflow.

11. Decision Matrix

Result	Decision
High task acceptance, low skill reuse	Improve skill learning and workflow templates
High interest, deployment friction	Invest in deploy runbook and health console
Good demos, low recurring use	Revisit target segment and workflow selection
High usage, trust concerns	Prioritize evidence narrative, policy, memory controls
Connector demand dominates	Narrow connector roadmap to one high-value system
Memory concerns dominate	Build Memory Control Center before expansion

9.5 KiB Raw Permalink Blame History

Beaver Validation Plan

1. Validation Strategy

2. Validation Questions

Product Value

Usability

Technical Feasibility

Governance And Security

Business Viability

3. Pilot Workflow Candidates

4. Customer Discovery Validation

Participants

Method

Evidence To Collect

Pass Criteria

5. Product Workflow Validation

Test 1: First Accepted Task

Test 2: Revision Loop

Test 3: Evidence Review

Test 4: Skill Reuse

Test 5: Scheduled Workflow

6. Technical Validation

Deployment Validation

Instance Isolation Validation

Runtime Validation

7. Security And Governance Validation

Control Plane

Files

Tools

Connectors

Memory

8. Usability Validation

9. Metrics Validation

10. Pilot Exit Criteria

11. Decision Matrix

9.5 KiB

Raw Permalink Blame History