beaver_project/docs/product-discovery/beaver/validation-plan.md

# Beaver Validation Plan

Date: 2026-06-09

Purpose: validate Beaver as a whole product before broader rollout.

## 1. Validation Strategy

Beaver should be validated through real workflows, not through opinions about AI.

The validation sequence:

```text
customer problem
  -> workflow fit
  -> first-run onboarding
  -> task execution
  -> evidence comprehension
  -> acceptance/revision
  -> skill reuse
  -> deployment and operations
  -> security/governance
```

## 2. Validation Questions

### Product Value

- Does Beaver solve a painful enough workflow problem?
- Does task acceptance make AI work feel more reliable?
- Do users complete more usable work than with chat-only AI?
- Does skill reuse save time after repeated workflows?

### Usability

- Can users understand when chat becomes a task?
- Can users find task evidence and artifacts?
- Can users accept, revise, or abandon without confusion?
- Can admins configure providers and connectors without engineering help?

### Technical Feasibility

- Can fresh deployments be created repeatably?
- Can app instances stay isolated?
- Can Agent tasks run reliably with files, tools, skills, and scheduled jobs?
- Can failures be diagnosed from status/logs/events?

### Governance And Security

- Are control-plane services private?
- Are file and workspace boundaries enforced?
- Are tool calls recorded and reviewable?
- Are external connector writes controlled?
- Is memory inspectable and controllable before broad use?

### Business Viability

- Does a pilot team have enough recurring workflows?
- Can the product produce measurable weekly value?
- Can an admin operate it with acceptable support load?
- Can the buyer justify expansion?

## 3. Pilot Workflow Candidates

| Workflow | Why It Fits | Required Capabilities | Success Signal |
| --- | --- | --- | --- |
| Weekly project report | Recurring, evidence-sensitive, review-heavy | scheduled work, files, task acceptance, artifacts | Report accepted weekly |
| Project brief / proposal | Multi-step, document-heavy, revision-heavy | chat, files, tools, task timeline, revisions | Brief accepted after fewer rounds |
| Document review | Clear deliverable and evidence need | files, task timeline, artifacts, acceptance | Review output accepted |
| Support triage | Tool/context-heavy and repeatable | tasks, tools, memory, maybe connector | Triage summary accepted |
| Research synthesis | Agent team fit, artifact-heavy | multi-agent, web/search, files, evidence | Synthesis accepted and reused |

Recommended first pilot:

1. Project brief or document review for manual task loop.
2. Weekly project report for scheduled workflow.
3. Skill reuse from the accepted outputs.

## 4. Customer Discovery Validation

### Participants

- 5 end users.
- 3 workflow owners.
- 3 admins/platform owners.
- 2 security reviewers.
- 2 operators/engineers.

### Method

- 45-minute interviews using past-behavior questions.
- 60-minute workflow walkthrough with Beaver.
- Follow-up after one week of usage.

### Evidence To Collect

- Current workflow steps.
- Time spent today.
- Existing tools/files/systems involved.
- Review/approval requirements.
- Trust blockers.
- Repeat frequency.
- What would count as a successful pilot.

### Pass Criteria

- At least 3 workflows are repeated weekly or more.
- At least 2 workflows involve files or external tools.
- At least 2 stakeholders require evidence/auditability.
- At least 1 team lead agrees to a real pilot workflow.

## 5. Product Workflow Validation

### Test 1: First Accepted Task

Goal: user reaches first accepted task.

Steps:

1. Register or log in.
2. Configure provider.
3. Start from a suggested workflow or freeform chat.
4. Upload or reference a file if needed.
5. Let Beaver create/continue a task.
6. Inspect output and evidence.
7. Accept or request revision.

Pass criteria:

- User completes without developer assistance.
- First accepted task occurs in one session.
- User can explain what Beaver did.

### Test 2: Revision Loop

Goal: prove Beaver handles "not good enough yet."

Steps:

1. Run a task.
2. Ask for a specific revision.
3. Confirm the same task context continues.
4. Accept revised output.

Pass criteria:

- Revision feedback is preserved.
- Task timeline shows revision.
- User does not need to restate full context.

### Test 3: Evidence Review

Goal: verify trust and auditability.

Steps:

1. Give reviewer a completed task detail page.
2. Ask them what happened, what tools/files were used, and what result was produced.
3. Ask whether they would approve the output.

Pass criteria:

- >=80% reviewers identify the key actions and artifacts.
- Reviewers can state at least one risk or confidence reason.

### Test 4: Skill Reuse

Goal: prove accepted work can compound.

Steps:

1. Accept a task.
2. Generate skill candidate/draft.
3. Review and publish skill.
4. Run a similar task.
5. Check whether skill activates and improves work.

Pass criteria:

- At least 3 pilot skills are reused twice.
- Users report lower effort on repeated task.

### Test 5: Scheduled Workflow

Goal: validate proactive work.

Steps:

1. Create scheduled job.
2. Trigger or wait for scheduled run.
3. Review notification/output.
4. Accept or revise.

Pass criteria:

- Scheduled run is visible.
- Output enters review flow.
- Failed run has clear recovery path.

## 6. Technical Validation

### Deployment Validation

Run on a fresh Linux/WSL2 host:

1. Build images.
2. Create Docker network.
3. Start router proxy.
4. Start authz service.
5. Start deploy control.
6. Start auth portal.
7. Register user.
8. Configure provider.
9. Open app instance.
10. Complete first task.

Pass criteria:

- Under 2 hours with docs only.
- No undocumented environment variables.
- Public exposure limited to auth portal and router proxy.

### Instance Isolation Validation

Checks:

- Instance A cannot access Instance B workspace.
- User file roots stay scoped.
- Router sends host to correct container.
- Provider config is instance-specific.
- Deleting one instance does not affect another.

Pass criteria:

- No cross-instance reads/writes.
- Registry state remains consistent.

### Runtime Validation

Checks:

- Chat API.
- WebSocket/runtime status.
- Task creation and deletion.
- Task detail events.
- File upload/preview/download/delete.
- Tool test.
- Skill candidate/draft/review/publish.
- Cron create/toggle/run/delete.
- Settings provider save.
- Runtime restart.

Pass criteria:

- Critical user flows pass on desktop and mobile viewport.
- Failure states have visible recovery.

## 7. Security And Governance Validation

### Control Plane

- Confirm `deploy-control` and `authz-service` are not publicly reachable.
- Confirm tokens are required for control-plane calls.
- Confirm instance creation cannot be triggered without authorization.

### Files

- Confirm only allowed user roots are visible.
- Confirm absolute-style or cross-prefix paths are rejected.
- Confirm delete operations require explicit user action.

### Tools

- Classify pilot tools as read, workspace write, external write, destructive, credential/permission.
- Record tool calls in task evidence.
- Block or require review for dangerous actions.

### Connectors

- Use sandbox/test accounts for pilot when possible.
- Confirm callback base URL is per-instance.
- Confirm disconnect/reconnect path.

### Memory

Until Memory Control Center exists:

- Keep memory use conservative.
- Document what is stored.
- Avoid enabling opaque long-term memory for sensitive pilots.

## 8. Usability Validation

Viewports:

- 320px.
- 375px.
- 390px.
- 768px.
- 1024px.
- 1365px.
- One mobile landscape viewport.

Screens:

- Auth portal login/register/provider onboarding.
- Chat workbench.
- Task list/detail.
- Files.
- Skills.
- Marketplace.
- Tools.
- Notifications/cron.
- Outlook/connectors if in pilot.
- Settings/status/logs.

Pass criteria:

- No horizontal overflow.
- No inaccessible critical controls.
- Touch targets are usable.
- Loading, empty, error, success, and disabled states are visible.

## 9. Metrics Validation

Instrument or manually collect:

- Time to first accepted task.
- Accepted tasks per user/team/week.
- Acceptance rate.
- Revision rate.
- Task run failure rate.
- Evidence coverage.
- Skill candidates.
- Skill drafts.
- Published skills.
- Skill reuse.
- Scheduled run success.
- Provider setup failure.
- Instance creation failure.
- Connector setup failure.

Minimum pilot dashboard:

```text
Accepted tasks
Acceptance rate
Revision rate
Task failures
Skill reuse
Scheduled runs
Deployment/provider errors
Critical incidents
```

## 10. Pilot Exit Criteria

Proceed to broader rollout only if:

- A pilot team completes >=30 accepted tasks in 30 days.
- At least 2 recurring workflows are active.
- At least 5 skills are created and 3 reused twice.
- Task acceptance rate is >=60%.
- No critical security or deployment incidents occur.
- Fresh deployment can be repeated from docs.
- Admin can diagnose common failures from status/logs/runbook.
- Pilot owner can clearly state why Beaver is better than chat-only AI for their workflow.

## 11. Decision Matrix

| Result | Decision |
| --- | --- |
| High task acceptance, low skill reuse | Improve skill learning and workflow templates |
| High interest, deployment friction | Invest in deploy runbook and health console |
| Good demos, low recurring use | Revisit target segment and workflow selection |
| High usage, trust concerns | Prioritize evidence narrative, policy, memory controls |
| Connector demand dominates | Narrow connector roadmap to one high-value system |
| Memory concerns dominate | Build Memory Control Center before expansion |