Refactor app instance to Keycloak SSO

2026-06-15 15:54:39 +08:00
parent fc9fd93c36
commit 461d1300ad
246 changed files with 1350 additions and 52721 deletions
--- a/docs/product-discovery/beaver/PRD-beaver-agent-sandbox.md
+++ b/docs/product-discovery/beaver/PRD-beaver-agent-sandbox.md
@ -1,489 +0,0 @@
-# PRD: Beaver Agent Sandbox
-
-Date: 2026-06-09
-
-Status: Product discovery draft for whole Beaver product
-
-## 1. Summary
-
-Beaver Agent Sandbox is a private-deployable workspace for enterprise Agent work. It lets users move from chat to managed tasks, execute work with files and tools, track evidence, accept or revise outputs, and turn successful work into reusable skills and memory.
-
-The first product goal is to prove that Beaver can help a pilot team complete repeatable knowledge work with more control, traceability, and reuse than chat-only AI tools.
-
-## 2. Contacts
-
-| Role | Owner | Comment |
-| --- | --- | --- |
-| Product owner | TBD | Owns positioning, roadmap, pilot metrics, research |
-| Engineering owner | TBD | Owns platform architecture and implementation quality |
-| Design owner | TBD | Owns workspace, task, review, admin, and onboarding UX |
-| Deployment owner | TBD | Owns Docker deployment, routing, instance lifecycle |
-| Security/review owner | TBD | Owns tool policy, data boundaries, connector safety |
-| Pilot owner | TBD | Owns customer/team selection and feedback loop |
-
-## 3. Background
-
-Most enterprise AI experiments start with chat. Chat is useful, but it is weak at real work:
-
- There is no durable task lifecycle.
- It is hard to see what happened.
- File and tool work is scattered.
- Results are not formally accepted or rejected.
- Successful workflows are not turned into reusable team capability.
- Admins cannot easily control deployment, tools, memory, and connectors.
-
-Beaver addresses this gap by acting as an Agent execution and governance layer. It combines a user workspace, task runtime, evidence timeline, file and tool operations, skill learning, scheduled work, connectors, and private multi-instance deployment.
-
-Why now:
-
- Teams are moving from AI demos to operational AI workflows.
- Enterprise buyers need governance, not only model access.
- Beaver already has enough implementation to support pilot workflows.
- The next step is product packaging, validation, and operational hardening.
-
-## 4. Objective
-
-### Objective
-
-Prove Beaver can deliver trusted, repeatable Agent work for pilot teams.
-
-### Key Results
-
-| Key Result | Target |
-| --- | --- |
-| Time to first accepted task | Pilot user reaches first accepted task within first session |
-| Accepted Agent Workflows | >=30 accepted tasks across pilot team within 30 days |
-| Acceptance Rate | >=60% of completed task runs accepted |
-| Evidence Coverage | >=90% of task runs show useful timeline/tool/artifact evidence |
-| Skill Reuse | >=5 reusable skills created, >=3 reused at least twice |
-| Deployment Repeatability | Fresh pilot deployment under 2 hours with documented steps |
-| Critical Incidents | 0 control-plane exposure, data leakage, or unintended external-write incidents |
-
-## 5. Market Segments
-
-### Primary Segment: Enterprise Teams Doing Repeatable Knowledge Work
-
-Examples:
-
- Project delivery teams.
- Operations teams.
- Internal strategy/research teams.
- Technical support and engineering teams.
- Customer success and sales operations teams.
-
-Their work is a good fit when it is:
-
- Repeated often.
- Multi-step.
- File-heavy.
- Tool-heavy.
- Needs review or approval.
- Benefits from a traceable process.
-
-### Buyer Segment: AI Platform Owner / IT Leader
-
-They need to provide AI capability without losing control over deployment, data, tools, and governance.
-
-### Admin Segment: Operator / Implementation Owner
-
-They set up Beaver, manage model providers, monitor health, handle connectors, and support users.
-
-### Maintainer Segment: Skill Owner
-
-They curate reusable skills and make sure published skills are safe, useful, and reviewable.
-
-## 6. Value Propositions
-
-### For Workflow Teams
-
-Beaver turns AI conversations into managed work. A request can become a task, produce artifacts, show evidence, and continue through revision until accepted.
-
-### For Platform Owners
-
-Beaver offers a private Agent sandbox with instance boundaries, tool governance, skills, and operational controls.
-
-### For Admins
-
-Beaver makes onboarding and operations more repeatable through auth portal, deploy control, routing, settings, status, and logs.
-
-### For Skill Maintainers
-
-Beaver turns accepted work into reusable skills through candidate, draft, safety/eval, review, and publish flow.
-
-### For End Users
-
-Beaver gives one place to chat, upload files, run tasks, preview outputs, review results, and reuse proven methods.
-
-## 7. Solution
-
-### 7.1 User Experience
-
-#### First-Run Experience
-
-```text
-User registers
-  -> app instance is created
-  -> user configures model provider
-  -> user enters Beaver workspace
-  -> user starts from a workflow template or chat
-  -> Beaver creates or continues a task
-  -> user accepts or revises the result
-```
-
-Requirements:
-
- Registration and instance provisioning must show clear progress and errors.
- Provider setup must be understandable and recoverable.
- If provider setup is skipped, the app must clearly explain why Agent calls cannot run.
-
-#### Daily User Workspace
-
-Primary screens:
-
- Chat workbench.
- Task list and task details.
- Files.
- Notifications and scheduled work.
- Skills and marketplace.
- Tool management.
- Settings/status/logs.
-
-Core user loop:
-
-```text
-Ask
-  -> execute
-  -> inspect evidence
-  -> accept/revise
-  -> reuse
-```
-
-#### Admin Experience
-
-Admin needs:
-
- See instance health.
- Configure providers.
- Configure channels/connectors.
- Restart safely.
- Inspect logs.
- Manage tools and skills.
- Understand failures.
-
-### 7.2 Key Features
-
-#### Authentication And Instance Provisioning
-
-Requirements:
-
- Users register or log in through auth portal.
- Registration triggers an app-instance container.
- Router maps instance host to container.
- Provider onboarding can configure model provider after instance creation.
-
-Acceptance criteria:
-
- New user can reach a working instance.
- Failed provisioning shows a recoverable error.
- `deploy-control` and `authz-service` are not public surfaces.
-
-#### Chat Workbench
-
-Requirements:
-
- Users can create/select sessions.
- Users can send text and attachments.
- Users can see Assistant messages, task cards, Agent run progress, and acceptance controls.
- Users can jump from chat to task detail.
-
-Acceptance criteria:
-
- User can complete one full chat-to-task-to-accept flow.
- Attachments can be uploaded and used.
- Current task status is visible.
-
-#### Task Lifecycle
-
-Requirements:
-
- System can distinguish ordinary chat from task requests.
- Task can be created, run, continued, revised, accepted, abandoned, or deleted.
- Task detail shows timeline, runs, tools, artifacts, result, and acceptance controls.
-
-Acceptance criteria:
-
- Task list and detail remain useful on mobile and desktop.
- Acceptance actions are persisted.
- Revision feedback continues the same task context.
-
-#### Agent Team Execution
-
-Requirements:
-
- Complex tasks can be planned as sequence, parallel, or DAG execution.
- Subtasks can bind skills or ephemeral guidance.
- Main Agent synthesizes final answer from evidence.
-
-Acceptance criteria:
-
- Subtask results are visible and debuggable.
- Failed team execution is shown without hiding partial evidence.
-
-#### Files Workspace
-
-Requirements:
-
- Users can upload, create folders, browse, preview, download, and delete files.
- Workspace roots stay understandable.
- File operations are safe within instance boundaries.
-
-Acceptance criteria:
-
- Root and nested directories work.
- Text/Markdown/image preview works.
- Long file names do not break layout.
-
-#### Tools And MCP
-
-Requirements:
-
- Admins can view, test, add, edit, delete, and refresh tools where supported.
- Agent runtime can expose tools based on task and skill context.
- Tool calls are recorded as evidence.
-
-Acceptance criteria:
-
- Tool detail and test flows work.
- Dangerous tools are governed by policy before broad rollout.
-
-#### Skills And Marketplace
-
-Requirements:
-
- Published skills can be listed, inspected, installed, uploaded, disabled, rolled back, or deleted where supported.
- Accepted work can create skill candidates.
- Candidates can become drafts.
- Drafts require safety/eval/review gates before publish.
- Marketplace supports discovery and install.
-
-Acceptance criteria:
-
- Candidate and draft flows do not reset UI state unexpectedly.
- Publish requires review gates.
- Published skill can be reused by later tasks.
-
-#### Memory
-
-Requirements:
-
- Beaver can store long-term preferences, business knowledge, historical task learning, file/artifact memory, tool experience, and reusable workflows.
- Before broad product use, users/admins need memory inspect/edit/delete/freeze controls.
-
-Acceptance criteria for Memory Control Center MVP:
-
- User can see what is remembered.
- User can see source and last-used context.
- User can edit, delete, or freeze memory.
- Task detail can show when memory affected execution.
-
-#### Scheduled Work And Notifications
-
-Requirements:
-
- Users can create scheduled jobs.
- Scheduled runs can produce notifications or tasks.
- Users can review, revise, or accept scheduled outputs.
-
-Acceptance criteria:
-
- Scheduled job can be created, toggled, run now, deleted.
- Scheduled output can enter normal task review flow.
-
-#### Connectors
-
-Requirements:
-
- Beaver can connect to external systems such as Outlook and selected IM/channel connectors.
- Connector status, setup, errors, and reconnect path must be visible.
- External writes require clear policy and safety boundary.
-
-Acceptance criteria:
-
- Pilot-safe connector list is documented.
- External connector callbacks route correctly in multi-instance deployment.
- Failed connector auth or setup is recoverable.
-
-#### Settings, Status, Logs
-
-Requirements:
-
- Users/admins can configure provider, Agent settings, channels, and runtime.
- Status page shows current app health.
- Logs help operators diagnose failures.
- Restart is confirmed before execution.
-
-Acceptance criteria:
-
- Provider save flow works.
- Runtime restart flow is protected by confirmation.
- Long config values do not break UI.
-
-### 7.3 Technology
-
-Frontend:
-
- Next.js app inside `app-instance/frontend`.
- App shell with chat, tasks, files, skills, marketplace, tools, connectors, settings, status, logs.
-
-Backend:
-
- Python Beaver backend inside `app-instance/backend`.
- Unified `beaver.engine` for Agent runtime.
- `beaver.coordinator` for multi-agent execution.
- `beaver.services` for task, cron, process, and application orchestration.
- `beaver.tools` for built-in/MCP tool execution.
- `beaver.skills` for skill loading, learning, review, publishing.
- `beaver.memory` for run memory, skills memory, long-term memory foundation.
- `beaver.interfaces` for web, MCP, channels, CLI/gateway surfaces.
-
-Deployment:
-
- `auth-portal`.
- `authz-service`.
- `deploy-control`.
- `router-proxy`.
- `app-instance`.
- Docker network and per-instance mounted runtime directories.
-
-### 7.4 Data And Evidence
-
-Important product data:
-
- Users and auth handoff.
- Instance registry.
- Provider configuration.
- Conversations and messages.
- Tasks, task runs, run events, timeline events.
- Tool calls and results.
- Files and artifacts.
- Skill receipts, candidates, drafts, safety/eval reports, reviews, published versions.
- Memory records.
- Scheduled jobs and scheduled runs.
- Connector state and events.
-
-Evidence principle:
-
-Every meaningful Agent action should become explainable later.
-
-### 7.5 Assumptions
-
- The best first customers are teams with repeatable knowledge workflows.
- Task acceptance is the right primary quality signal.
- Private deployment is a benefit, not a barrier, for early enterprise pilots.
- Teams will value skill/memory reuse after enough accepted tasks.
- Admins can operate a Docker-based deployment with a clear runbook.
- Memory must be controllable before it can be trusted.
-
-### 7.6 Non-Goals For First Pilot
-
- Broad public SaaS launch.
- Full multi-tenant organization management.
- Fully autonomous skill publishing.
- Production external writes without clear review.
- Complete enterprise RBAC.
- Unlimited connector support.
- Perfect long-term memory automation.
- Replacing human review for high-risk work.
-
-## 8. Release
-
-### Release 0: Internal Demo Readiness
-
-Scope:
-
- Clean local deployment.
- Auth portal registration/login.
- Provider onboarding.
- Chat-to-task demo.
- Task detail evidence.
- File upload/preview.
- Skills and marketplace demo.
- Settings/status/logs.
-
-Exit criteria:
-
- Demo flow works on fresh environment.
- Known limitations are documented.
- No critical security/deployment issue.
-
-### Release 1: Pilot Workflow Release
-
-Scope:
-
- 2-3 packaged workflows.
- Task acceptance and evidence as main flow.
- Files and selected tools.
- Basic scheduled workflow.
- One pilot-safe connector if stable.
- Skill candidate/draft/review/publish.
- Deployment runbook and support checklist.
-
-Exit criteria:
-
- Pilot team reaches >=30 accepted tasks in 30 days.
- >=5 reusable skills created.
- 0 critical incidents.
- Deployment under 2 hours on fresh host.
-
-### Release 2: Governance And Reuse Release
-
-Scope:
-
- Evidence narrative.
- Memory Control Center.
- Skill replay/eval governance.
- Admin health console.
- Connector policy hardening.
- Pilot scorecard.
-
-Exit criteria:
-
- Reviewers understand evidence.
- Users can inspect and control memory.
- Admins can diagnose provider/connector/runtime issues.
- Skill reuse is visible in metrics.
-
-### Release 3: Expansion Release
-
-Scope:
-
- Team/workspace concepts if validated.
- More connectors.
- Audit export.
- Cross-instance analytics.
- Policy profiles.
- Instance lifecycle automation.
-
-Exit criteria:
-
- Multiple teams can run without high support load.
- Governance story supports enterprise buying process.
-
-## Open Questions
-
- Is the first paying segment project teams, operations teams, engineering/support, or internal AI platform teams?
- Should Beaver optimize for single-user instances first or team workspaces sooner?
- Which connector is the safest and most valuable pilot connector?
- What exact tool policy should apply in customer pilots?
- What memory behavior should be on by default?
- How much raw evidence should normal users see versus admins?
- What is the backup/restore SLA for app instances?
-
-## Success Review Checklist
-
- Can a new user get to first accepted task quickly?
- Can a reviewer understand what the Agent did?
- Can an admin recover from provider or connector errors?
- Can a successful task become a reusable skill?
- Can a pilot owner prove value with metrics?
- Can security explain the deployment and tool boundaries?
--- a/docs/product-discovery/beaver/README.md
+++ b/docs/product-discovery/beaver/README.md
@ -1,30 +1,50 @@
-# Beaver Product Discovery
+# Beaver Standalone App Instance

-This folder covers Beaver as the whole product, not only one feature.
+This branch narrows Beaver to a clean standalone app instance that an external orchestrator can deploy.

-Beaver is an enterprise Agent sandbox and execution platform. It combines private deployment, per-user app instances, chat-to-task execution, task evidence, user acceptance, files, tools, skills, memory, connectors, scheduled work, and governance.
+## Product Boundary

-## Documents
+The app instance provides:

- [Business Strategy HTML](./index.html): business-style product discovery, strategy canvas, target users, segmentation, and competitors.
- [Product PRD HTML](./product-prd.html): product PRD, outcome roadmap, module job stories, WWA backlog items, and test scenarios.
- [Product Discovery Report](./product-discovery-report.md): product understanding, users, JTBD, opportunities, assumptions, experiments, priorities, metrics, and 30/90 day recommendations.
- [Product Architecture Brief](./product-architecture-brief.md): product-facing architecture across auth, deployment control, routing, app instances, frontend, backend, Agent runtime, tools, skills, memory, files, connectors, and operations.
- [PRD](./PRD-beaver-agent-sandbox.md): full-product PRD for the Beaver Agent Sandbox.
- [Validation Plan](./validation-plan.md): customer, product, technical, security, usability, and business validation plan.
- [Launch And Maintenance Runbook](./launch-maintenance-runbook.md): launch phases, readiness checks, monitoring, incident response, maintenance cadence, and rollback.
+- Chat and task workspace
+- Files, tools, skills, memory, schedules, and runtime pages
+- Backend API and WebSocket access behind the same origin
+- Keycloak SSO login with Authorization Code Flow + PKCE
+- JWT-based user identity using Keycloak `sub`
+
+The app instance does not provide:
+
+- Local registration or password login
+- User ID lifecycle management
+- Per-user instance creation
+- Hostname routing
+- Deployment control-plane APIs
+- Keycloak client provisioning
+
+## External Responsibilities
+
+The external orchestrator owns:
+
+- Container lifecycle
+- Public URL, TLS, reverse proxy, and port mapping
+- Data volume provisioning
+- `config.json` provisioning
+- Keycloak redirect URI and web origin registration
+- Multi-instance or tenant mapping, if needed later
+
+## Current SSO Values
+
+```text
+issuer:       https://keycloak.bwgdi.com/realms/beaver
+client_id:    beaver-agnet
+web_origin:   http://172.19.0.245:18080
+redirect_uri: http://172.19.0.245:18080/auth/callback
+post_logout_redirect_uri: http://172.19.0.245:18080/logout/callback
+```

 ## Source Material

 - [Project README](../../../README.md)
- [Deployment Guide](../../../部署指南.md)
- [Domain Guide](../../../域名配置指引.md)
 - [App Instance README](../../../app-instance/README.md)
 - [Backend README](../../../app-instance/backend/README.md)
- [Recent Backend Features](../../../projcet_review/backend_recent_completed_features.md)
 - [UI/UX Page Docs](../../ui-ux/README.md)
- [Customer Presentation](../../presentations/skill-replay-eval/index.html)
-
-## Related Feature Discovery
-
- [Skill Replay Eval Discovery](../skill-replay-eval/README.md)
--- a/docs/product-discovery/beaver/index.html
+++ b/docs/product-discovery/beaver/index.html
--- a/docs/product-discovery/beaver/launch-maintenance-runbook.md
+++ b/docs/product-discovery/beaver/launch-maintenance-runbook.md
@ -1,455 +0,0 @@
-# Beaver Launch And Maintenance Runbook
-
-Date: 2026-06-09
-
-Scope: whole Beaver product.
-
-## 1. Launch Principle
-
-Launch Beaver through controlled pilots before broad rollout.
-
-The product has a wide operational surface: auth, deployment control, routing, per-instance app containers, model providers, Agent runtime, tools, files, skills, memory, scheduled work, and connectors. A successful launch depends as much on reliability and trust as on feature completeness.
-
-## 2. Launch Roles
-
-| Role | Responsibility |
-| --- | --- |
-| Launch owner | Owns readiness, go/no-go, rollout phases |
-| Deployment owner | Owns Docker images, network, router, instance lifecycle |
-| Backend owner | Owns Agent runtime, tasks, tools, skills, cron, APIs |
-| Frontend owner | Owns user-facing flows and UI verification |
-| Security owner | Owns control-plane exposure, data boundaries, tool/connector policy |
-| Pilot owner | Owns user onboarding, workflow selection, feedback, metrics |
-| Support owner | Owns incident triage, runbook updates, user support |
-
-## 3. Launch Phases
-
-### Phase 0: Local Internal Readiness
-
-Audience: builders and internal testers.
-
-Goals:
-
- Full local deployment works.
- Core demo flows are stable.
- Known risks are documented.
-
-Required flows:
-
- Register/login.
- Provider onboarding.
- First chat response.
- Chat-to-task.
- Task acceptance/revision.
- File upload/preview/download/delete.
- Skill list/candidate/draft/review.
- Settings/status/restart.
-
-Exit criteria:
-
- Fresh deployment run completed from docs.
- No P0 or launch-blocking P1 issues.
- Demo script works end to end.
-
-### Phase 1: Controlled Pilot
-
-Audience: one internal team or one trusted customer team.
-
-Goals:
-
- Validate real workflow value.
- Validate deployment and support process.
- Validate trust, evidence, and governance story.
-
-Constraints:
-
- Narrow workflow scope.
- Narrow connector scope.
- Conservative tool policy.
- Human review for skill publishing.
- No opaque memory use for sensitive data.
-
-Exit criteria:
-
- >=30 accepted tasks in 30 days.
- >=2 recurring workflows.
- 0 critical incidents.
- Deployment/support issues documented and reduced.
-
-### Phase 2: Expanded Pilot
-
-Audience: more users in same team or a second pilot team.
-
-Goals:
-
- Test repeatability across workflows.
- Introduce Memory Control Center or stricter memory policy if ready.
- Strengthen skill reuse and scheduled work.
-
-Exit criteria:
-
- Skill reuse becomes visible.
- Admin can operate without developer pairing for common tasks.
- Evidence and report quality are accepted by workflow owner.
-
-### Phase 3: Production Candidate
-
-Audience: broader customer or department rollout.
-
-Goals:
-
- Stabilized deployment.
- Health monitoring.
- Incident response.
- Backup/restore process.
- Policy profiles.
-
-Exit criteria:
-
- Launch owner, security owner, and deployment owner approve.
- Support process has clear ownership.
- Rollback and restore are rehearsed.
-
-## 4. Pre-Launch Checklist
-
-### Deployment
-
- [ ] Images build successfully.
- [ ] Docker network exists.
- [ ] Router proxy starts.
- [ ] AuthZ service starts.
- [ ] Deploy control starts.
- [ ] Auth portal starts.
- [ ] App instance can be created.
- [ ] App instance route works through router proxy.
- [ ] Provider config can be written and instance restarted.
- [ ] Runtime directories are persistent.
- [ ] Public exposure limited to intended services.
-
-### Product Flows
-
- [ ] Register/login works.
- [ ] Provider onboarding works.
- [ ] Chat workbench loads.
- [ ] Task creation works.
- [ ] Task detail timeline works.
- [ ] Acceptance/revision/abandon works.
- [ ] Files page works.
- [ ] Tools page works for pilot tools.
- [ ] Skills page works.
- [ ] Marketplace install works if included.
- [ ] Cron/scheduled flow works if included.
- [ ] Connector flow works if included.
- [ ] Settings/status/logs work.
-
-### Governance
-
- [ ] Tool policy for pilot is documented.
- [ ] Connector side effects are understood.
- [ ] Skill publish gates are documented.
- [ ] Memory behavior is documented.
- [ ] Data retention expectations are documented.
- [ ] User-facing limitations are documented.
-
-### Support
-
- [ ] Pilot support channel exists.
- [ ] Incident owner assigned.
- [ ] Logs and health checks are accessible.
- [ ] Backup/restore expectations are clear.
- [ ] Known issues list exists.
-
-## 5. Monitoring
-
-### Product Metrics
-
-| Metric | Owner | Cadence |
-| --- | --- | --- |
-| Accepted tasks | Pilot owner | Weekly |
-| Acceptance rate | Product owner | Weekly |
-| Revision rate | Product owner | Weekly |
-| Active workflows | Pilot owner | Weekly |
-| Skill candidates and reuse | Product owner | Weekly |
-| Scheduled run success | Backend owner | Weekly |
-| Time to first accepted task | Product/design | Per onboarding |
-
-### Operational Metrics
-
-| Metric | Owner | Alert |
-| --- | --- | --- |
-| Instance creation failures | Deployment owner | >10% during pilot |
-| Router route failures | Deployment owner | Any repeated failure |
-| Provider setup failures | Support owner | >20% of onboarded users |
-| Task run failures | Backend owner | >20% for 2 days |
-| WebSocket/runtime disconnects | Backend/frontend | Repeated user-visible failures |
-| File operation failures | Backend owner | Any permission/path issue |
-| Tool execution failures | Backend owner | Repeated by tool category |
-| Cron failures | Backend owner | Any critical scheduled workflow missed |
-| Connector failures | Integration owner | Failed auth or unintended write |
-
-### Security Metrics
-
-| Metric | Alert |
-| --- | --- |
-| Control-plane public exposure | Immediate P0 |
-| Cross-instance data access | Immediate P0 |
-| Unintended external write | Immediate P0 |
-| Credential leak in logs/report | Immediate P0 |
-| Unsafe skill publish | P1, or P0 if external action risk |
-
-## 6. Health Checks
-
-### Control Plane
-
- Auth portal reachable.
- AuthZ service reachable internally.
- Deploy control reachable internally with token.
- Router proxy has generated routes.
- Instance registry is readable and consistent.
-
-### App Instance
-
- Frontend loads.
- Backend `/api/status` responds.
- WebSocket works.
- Provider config present.
- Workspace path mounted.
- Initial skills present.
- Logs accessible.
-
-### Product Runtime
-
- Chat request succeeds.
- Task run succeeds.
- File API succeeds.
- Tool registry loads.
- Skills list loads.
- Cron scheduler active if enabled.
- Connector status loads if enabled.
-
-## 7. Incident Response
-
-### P0: Control Plane Exposed
-
-Examples:
-
- `deploy-control` accessible from public internet.
- `authz-service` accessible from public internet.
- Internal token leaked.
-
-Actions:
-
-1. Remove public route/firewall exposure.
-2. Rotate affected tokens.
-3. Review access logs.
-4. Confirm no unauthorized instance operations.
-5. Update deployment checklist.
-
-### P0: Cross-Instance Data Leak
-
-Examples:
-
- Instance A reads Instance B workspace.
- Router sends user to wrong instance.
- Shared connector callback writes to wrong instance.
-
-Actions:
-
-1. Disable affected route or instance.
-2. Preserve logs and registry.
-3. Identify path/host/callback mapping failure.
-4. Patch and add regression test.
-5. Notify affected stakeholders.
-
-### P0: Unintended External Action
-
-Examples:
-
- Email or IM message sent unexpectedly.
- Calendar invite created unexpectedly.
- External system updated without user intent.
-
-Actions:
-
-1. Disable connector or tool.
-2. Preserve task/tool evidence.
-3. Identify initiating task, tool, arguments, user, connector account.
-4. Patch policy or confirmation gate.
-5. Add test case and update pilot policy.
-
-### P1: New User Cannot Reach Instance
-
-Actions:
-
-1. Check auth portal logs.
-2. Check authz register flow.
-3. Check deploy-control register/configure flow.
-4. Check instance registry.
-5. Check router route generation.
-6. Check container state and app logs.
-
-### P1: Provider Config Broken
-
-Actions:
-
-1. Check settings/status.
-2. Confirm config path and provider fields.
-3. Test provider credentials.
-4. Restart instance if config was changed.
-5. Improve onboarding copy if user error.
-
-### P1: Task Runtime Failing
-
-Actions:
-
-1. Check backend logs.
-2. Check provider availability.
-3. Check tool registry.
-4. Check task event timeline.
-5. Reproduce with minimal chat request.
-6. Mark affected pilot workflow as paused if repeated.
-
-### P2: UI Flow Confusing
-
-Actions:
-
-1. Record screen and user quote.
-2. Add to UX issue list.
-3. Determine whether it blocks pilot success.
-4. Fix copy/layout if low effort.
-
-## 8. Maintenance Cadence
-
-### Daily During Pilot
-
- Check critical incidents.
- Check instance health.
- Check failed task runs.
- Check support channel.
- Review provider/connector errors.
-
-### Weekly
-
- Review accepted tasks and acceptance rate.
- Review workflow success/failure.
- Review skill candidates and reuse.
- Review deployment issues.
- Review security/tool/connector events.
- Update known issues and runbook.
-
-### Monthly
-
- Rehearse fresh deployment.
- Review backup/restore approach.
- Review memory and skill governance.
- Review connector roadmap.
- Review pilot ROI and expansion decision.
-
-### Quarterly
-
- Revisit product positioning.
- Revisit architecture scaling assumptions.
- Decide team workspace / RBAC roadmap.
- Review security model and policy profiles.
-
-## 9. Backup And Restore
-
-Minimum data to preserve:
-
- `authz-service/runtime/data`
- `app-instance/runtime/instances`
- `app-instance/runtime/registry`
- `router-proxy/runtime/conf.d`
-
-Per instance:
-
- `beaver-home/config.json`
- `beaver-home/web_auth_users.json`
- `beaver-home/workspace/`
- skill and runtime state under instance data.
-
-Pilot requirements:
-
- Document manual backup command.
- Document manual restore procedure.
- Test restore for at least one non-production instance before expanded pilot.
-
-## 10. Change Management
-
-Before changing any of these, require launch owner review:
-
- Routing/proxy config.
- AuthZ issuer/internal URL.
- Deploy token names or values.
- Instance registry format.
- Workspace mount paths.
- Provider config schema.
- Tool execution policy.
- Connector callback routing.
- Skill publish gates.
- Memory default behavior.
-
-## 11. Rollback
-
-Rollback options:
-
- Roll back frontend/backend image for app instances.
- Disable specific connector.
- Disable scheduled job execution.
- Disable skill learning worker.
- Disable skill publish.
- Fall back to chat-only mode for affected workflow.
- Remove public route to affected instance.
- Restore instance data from backup.
-
-Rollback triggers:
-
- P0 incident.
- Repeated instance creation failure.
- Repeated task runtime failure blocking pilot work.
- Provider config issue affecting most users.
- Connector side-effect risk.
- UI issue blocking first accepted task.
-
-## 12. Launch Communication
-
-### Internal
-
-Beaver is launching as a controlled Agent execution pilot. The launch goal is not maximum feature breadth. The goal is to prove repeatable AI-assisted work with task acceptance, evidence, and reuse.
-
-### Pilot Users
-
-Use Beaver for selected workflows where you need a concrete output. Review each result. Accept it if usable, request revision if it is close, or abandon it if it is not worth continuing. Your feedback is the signal that helps Beaver improve and reuse work.
-
-### Admins
-
-Treat Beaver as an app platform with a control plane and per-instance runtime. Keep deploy-control and authz private. Monitor instance health, provider config, tool behavior, and connector side effects.
-
-## 13. Known Limitations To Disclose
-
- Memory is not yet fully productized with user controls.
- Connector maturity varies by provider.
- The first pilot should use a narrow set of workflows.
- Some operations may still require engineering support.
- Skill learning needs human review before publish.
- Multi-user organization features are not the first pilot focus.
-
-## 14. Go / No-Go Criteria
-
-Go if:
-
- Fresh deployment works.
- First accepted task flow works.
- Evidence timeline is readable enough for pilot.
- Tool and connector policy is documented.
- Support owner is assigned.
- No critical security issue is open.
-
-No-go if:
-
- Control-plane exposure risk is unresolved.
- Cross-instance isolation is unverified.
- Provider onboarding fails for most users.
- Task runtime is unreliable.
- Pilot workflow is not defined.
- No one owns incidents or support.
--- a/docs/product-discovery/beaver/product-architecture-brief.md
+++ b/docs/product-discovery/beaver/product-architecture-brief.md
@ -1,439 +0,0 @@
-# Beaver Product Architecture Brief
-
-Date: 2026-06-09
-
-Audience: product, engineering, delivery, security, and pilot stakeholders.
-
-## 1. Architecture Summary
-
-Beaver is built as a private-deployable, multi-instance Agent workspace.
-
-At the top level, it has five deployment components:
-
-```text
-Browser
-  -> auth-portal
-  -> authz-service
-  -> deploy-control
-  -> router-proxy
-  -> app-instance
-```
-
-Each `app-instance` contains the user-facing product:
-
-```text
-app-instance container
-  -> Nginx
-  -> Next.js frontend
-  -> Beaver backend
-  -> mounted beaver-home
-       -> config
-       -> workspace
-       -> skills
-       -> runtime data
-```
-
-The key product architecture choice is instance-level sandboxing. Each user or team can receive a separate app instance with its own config, workspace, files, skills, and runtime data.
-
-## 2. Product-Level System Map
-
-```text
-Auth and onboarding
-  auth-portal
-    -> register/login
-    -> model provider onboarding
-  authz-service
-    -> account and backend identity
-  deploy-control
-    -> create/configure/remove app-instance
-  router-proxy
-    -> route instance host to app-instance container
-
-User workspace
-  app-instance/frontend
-    -> chat workbench
-    -> tasks
-    -> files
-    -> skills
-    -> marketplace
-    -> MCP/tools
-    -> notifications/cron
-    -> connectors
-    -> settings/status/logs
-
-Agent runtime
-  app-instance/backend
-    -> interfaces
-    -> services
-    -> engine
-    -> coordinator
-    -> tools
-    -> skills
-    -> memory
-    -> integrations
-```
-
-## 3. Deployment Components
-
-### Auth Portal
-
-Responsibility:
-
- User login and registration entry.
- Provider onboarding after registration.
- Handoff into the user app instance.
-
-Product value:
-
- Gives non-technical users a clean entry point.
- Separates account onboarding from the per-instance app.
-
-Key risk:
-
- Provider configuration must be understandable and recoverable for non-engineer users.
-
-### AuthZ Service
-
-Responsibility:
-
- Account and backend identity orchestration.
- Internal token-protected coordination.
-
-Product value:
-
- Centralizes identity relationships between portal and app backends.
-
-Key risk:
-
- Misconfigured issuer/internal URL can break new app instances.
-
-### Deploy Control
-
-Responsibility:
-
- Create, configure, and manage app instances.
- Call `app-instance/create-instance.sh`.
- Write provider config and restart instance when needed.
-
-Product value:
-
- Makes private instance provisioning repeatable.
-
-Key risk:
-
- Must not be exposed publicly.
- Needs health checks and lifecycle operations for pilot scale.
-
-### Router Proxy
-
-Responsibility:
-
- Route hostnames to the correct app instance container.
-
-Product value:
-
- Lets each instance have a stable public URL.
-
-Key risk:
-
- Domain, wildcard DNS, HTTPS, and route reload errors can block access.
-
-### App Instance
-
-Responsibility:
-
- The user-facing Beaver workspace.
- Runs frontend, backend, and Nginx in one container.
- Mounts the instance's `beaver-home` as config and workspace boundary.
-
-Product value:
-
- Provides practical sandboxing for early private deployments.
-
-Key risk:
-
- Instance lifecycle, backup, restore, and resource limits need productized operations.
-
-## 4. App Instance Product Modules
-
-### Frontend Modules
-
-| Module | Route | Product Job |
-| --- | --- | --- |
-| Chat workbench | `/` | Main workspace for conversation, attachments, task cards, and acceptance |
-| Tasks | `/tasks`, `/tasks/[taskId]` | Track ordinary and scheduled task lifecycle, timeline, evidence, artifacts |
-| Notifications | `/notifications` | Review proactive or scheduled outputs |
-| Cron | `/cron` | Manage scheduled jobs |
-| Files | `/files` | Browse, upload, preview, download, delete workspace files |
-| Skills | `/skills` | Manage published skills, candidates, drafts, safety/eval, review, publish |
-| Marketplace | `/marketplace` | Discover and install skills |
-| MCP/tools | `/mcp` | Manage tool servers, tool details, test, add, edit, delete |
-| Agents | `/agents` | Manage Agent definitions and roles |
-| Outlook/connectors | `/outlook`, settings connector panels | Connect external systems |
-| Settings/status/logs | `/settings`, `/status`, `/logs` | Configure providers, runtime, channels, health, and debugging |
-
-### Backend Modules
-
-| Module | Responsibility |
-| --- | --- |
-| `foundation` | Shared config, errors, events, utilities, base models |
-| `engine` | Unified Agent runtime used by main Agent and sub-agents |
-| `coordinator` | Multi-agent sequence/parallel/DAG execution |
-| `tools` | Built-in and MCP tool registration/execution |
-| `skills` | Skill loading, resolution, drafts, learning, review, publish |
-| `memory` | Long-term memory and run/skill stores |
-| `permissions` | Governance and policy surface |
-| `services` | Application orchestration, tasks, cron, process projection |
-| `interfaces` | Web, CLI, Gateway, channels, MCP servers |
-| `integrations` | AuthZ, MCP, external protocols, connector clients |
-
-## 5. Core Product Flows
-
-### Flow A: New User Registration And First Workspace
-
-```text
-Browser
-  -> auth-portal register
-  -> authz-service /portal/register
-  -> deploy-control /api/instances/register
-  -> create app-instance container
-  -> app-instance backend registers user/backend
-  -> provider onboarding
-  -> deploy-control configures provider
-  -> user enters app-instance URL
-```
-
-Product requirements:
-
- Clear success/failure state during provisioning.
- Provider setup can be skipped but instance must explain missing model config later.
- Internal control-plane endpoints stay private.
-
-### Flow B: Chat To Managed Task
-
-```text
-User message
-  -> chat workbench
-  -> backend task router
-  -> ordinary chat or task mode
-  -> task created
-  -> Agent execution
-  -> tool calls and artifacts
-  -> task timeline
-  -> user accepts / asks revision / abandons
-```
-
-Product requirements:
-
- The user must understand when a message became a task.
- The task must be recoverable from chat, task list, and details page.
- Acceptance feedback must influence future learning.
-
-### Flow C: Complex Task With Agent Team
-
-```text
-Task request
-  -> TaskExecutionPlanner
-  -> ExecutionGraph
-       -> sequence / parallel / DAG nodes
-  -> TaskSkillResolver binds skills or ephemeral guidance
-  -> LocalAgentRunner executes nodes
-  -> main Agent synthesizes final answer
-  -> evidence saved
-```
-
-Product requirements:
-
- Team execution should be visible without overwhelming users.
- Failed subtasks should be diagnosable.
- Final synthesis should cite or summarize subtask evidence.
-
-### Flow D: Skill Learning Loop
-
-```text
-Accepted task
-  -> skill learning candidate
-  -> draft synthesis
-  -> safety report
-  -> eval report
-  -> human review
-  -> publish
-  -> future skill retrieval
-```
-
-Product requirements:
-
- Only accepted or otherwise high-signal work should become skill candidates.
- Publishing requires review and gates.
- Skill quality must be traceable over versions.
-
-### Flow E: File And Tool Work
-
-```text
-User uploads file or Agent needs file/tool
-  -> workspace file API or tool registry
-  -> Agent tool execution
-  -> result returned to context
-  -> event/evidence saved
-  -> artifact available in task or files
-```
-
-Product requirements:
-
- User-visible file roots must stay simple.
- Tool calls must be recorded.
- Dangerous tools need policy and review.
-
-### Flow F: Scheduled Work And Notifications
-
-```text
-User creates scheduled job
-  -> cron service stores job
-  -> scheduled run triggers task/notification
-  -> user reviews output
-  -> output can become normal task continuation
-```
-
-Product requirements:
-
- Scheduled outputs need the same acceptance path as manual tasks.
- Failed scheduled runs need alerts and retry/recovery.
-
-### Flow G: External Connectors
-
-```text
-Connector setup
-  -> channel/connector config
-  -> sidecar or external provider
-  -> inbound event or outbound action
-  -> Beaver task/runtime
-  -> response or notification
-```
-
-Product requirements:
-
- External writes need clear user/admin control.
- Connector onboarding must show state, errors, and reconnect steps.
- Multi-instance callback routing must be explicit.
-
-## 6. Governance Boundaries
-
-### Instance Boundary
-
-Each app instance owns:
-
- `config.json`
- `web_auth_users.json`
- `workspace/`
- skills and runtime state
- provider configuration
-
-Risk:
-
- Cross-instance leakage would be a critical incident.
-
-### Control Plane Boundary
-
-Public exposure should be limited to:
-
- Auth portal.
- Router proxy for app instances.
-
-Do not expose:
-
- `deploy-control`.
- `authz-service`.
-
-### Tool Boundary
-
-Tools are the action surface. Policy should distinguish:
-
- Read-only tools.
- Workspace-scoped write tools.
- External write tools.
- Destructive tools.
- Credential/permission/payment tools.
-
-### Skill Boundary
-
-Skills guide Agent behavior and tool use. Publishing a bad skill can create repeated bad behavior. Skill publishing therefore needs:
-
- Candidate quality signal.
- Safety report.
- Eval/replay evidence where possible.
- Human review.
- Version rollback.
-
-### Memory Boundary
-
-Memory creates long-term product value but also trust risk. Productization should include:
-
- Source.
- Confidence.
- Last used.
- Edit/delete/freeze controls.
- Task evidence showing when memory was used.
-
-## 7. Architecture Maturity
-
-| Area | Maturity | Notes |
-| --- | --- | --- |
-| Multi-instance deployment | Pilot-ready | Needs lifecycle and health automation |
-| Chat workbench | Pilot-ready | UI docs show tested states |
-| Task lifecycle | Strong | Core product loop exists |
-| Task evidence | Strong foundation | Needs narrative/summary layer |
-| Agent team | Functional | Needs product explanation and failure UX |
-| Files | Pilot-ready | UI docs show tested workflows |
-| Tools/MCP | Functional | Needs policy hardening and admin clarity |
-| Skills | Functional | Needs stronger quality gates and reuse metrics |
-| Memory | Backend foundation | Needs visible product controls |
-| Scheduled work | Basic product capability | Needs stability and clearer run handling |
-| Connectors | Mixed maturity | Need pilot-safe connector list |
-| Operations | Basic | Needs health console, backup/restore, runbook |
-
-## 8. Architecture Risks
-
-| Risk | Severity | Mitigation |
-| --- | --- | --- |
-| Control-plane service exposed publicly | Critical | Deployment checks and docs; firewall/proxy validation |
-| Instance data leakage | Critical | Path isolation tests, authz tests, MinIO/user-files policy checks |
-| Tool side effects without review | High | Tool policy profiles, evidence logs, connector sandbox |
-| Provider misconfiguration blocks first use | High | Onboarding checks and settings diagnostics |
-| Product surface becomes hard to operate | High | Admin health console and staged pilot scope |
-| Memory trust gap | High | Memory control center before broad memory activation |
-| Skill quality drift | High | Safety/eval/replay and publish gates |
-
-## 9. Recommended Architecture Roadmap
-
-### Next 30 Days
-
- Rehearse clean deployment and record missing steps.
- Add pilot health checklist for auth portal, authz, deploy control, router, and app instance.
- Define pilot-safe tools and connectors.
- Add task evidence narrative summary.
- Track accepted task, skill candidate, and skill reuse events.
-
-### Next 90 Days
-
- Memory Control Center MVP.
- Admin Health Console MVP.
- Instance suspend/resume/backup/restore runbook or tooling.
- Connector sandboxing and side-effect policy.
- Skill replay/eval as part of skill governance.
- Organization/team-level roadmap decision.
-
-## 10. Product Architecture Principle
-
-Beaver should keep its product architecture centered on controlled Agent work:
-
-```text
-private workspace
-  + task lifecycle
-  + tool/file execution
-  + evidence
-  + acceptance
-  + skill/memory reuse
-  + operational governance
-```
-
-New features should strengthen this loop. Features that do not improve execution, evidence, acceptance, reuse, or governance should be treated as secondary until the pilot motion is proven.
--- a/docs/product-discovery/beaver/product-discovery-report.md
+++ b/docs/product-discovery/beaver/product-discovery-report.md
@ -1,494 +0,0 @@
-# Beaver Product Discovery Report
-
-Date: 2026-06-09
-
-Product stage: existing product
-
-Scope: the whole Beaver product, including deployment, runtime, UI, Agent execution, tasks, files, tools, skills, memory, connectors, scheduled work, governance, validation, launch, and maintenance.
-
-## Executive Summary
-
-Beaver is an enterprise Agent sandbox and execution platform. Its product promise is to move AI from "chat that gives answers" to "controlled Agent work that creates deliverables, records evidence, asks for acceptance, and turns accepted work into reusable capability."
-
-The strongest product wedge is not another chatbot UI. It is the full execution loop:
-
-```text
-user request
-  -> task recognition
-  -> Agent/team execution
-  -> tool and file work
-  -> evidence timeline
-  -> user acceptance or revision
-  -> skill and memory learning
-  -> future reuse
-```
-
-The current codebase already supports major parts of this loop: multi-instance Docker deployment, auth portal, app instances, chat workbench, task center, task details, user acceptance, files, tools, skills, skill learning, marketplace, settings, connectors, scheduled jobs, and backend Agent team orchestration. The next product challenge is packaging these capabilities into a clear buyer story, validating the highest-value use cases, hardening operational reliability, and making governance understandable to non-engineer stakeholders.
-
-Recommended product strategy:
-
-1. Position Beaver as "enterprise Agent execution and governance," not as a general AI chat app.
-2. Focus first on repeatable knowledge work that is high-frequency, cross-tool, evidence-sensitive, and review-heavy.
-3. Treat task acceptance, evidence, skills, and memory as the core product loop.
-4. Productize deployment and operations enough for pilots before broad feature expansion.
-5. Validate value through real workflows, not opinions about AI.
-
-## Product Summary
-
-### Product Description
-
-Beaver is a private-deployable Agent workspace for teams that need AI to perform work, not only answer questions. A user can chat, upload files, trigger tasks, review execution evidence, accept or revise results, manage tools, install or publish skills, configure model providers, connect external systems, and run scheduled work.
-
-### Target Users
-
-| Segment | Primary Need | Why Beaver Fits |
-| --- | --- | --- |
-| Enterprise AI platform owner | Provide controlled Agent capability to teams | Private deployment, per-instance boundaries, tools, skills, governance |
-| Knowledge workflow team | Finish recurring multi-step work faster | Task execution, files, tools, acceptance, scheduled work |
-| Project / delivery team | Produce and revise deliverables with traceability | Task timeline, artifacts, evidence, revision loop |
-| Engineering / support team | Use AI with files, commands, logs, and review | Tool execution, task evidence, multi-agent planning |
-| Operations / admin | Configure models, users, connectors, and instances | Auth portal, deploy control, settings, status, logs |
-| Skill owner / reviewer | Turn successful work into reusable methods | Skill candidates, drafts, safety/eval reports, review, publish |
-
-### Current Feature Map
-
-| Domain | Current State | Product Meaning |
-| --- | --- | --- |
-| Auth and onboarding | Auth portal, register/login, model provider onboarding | Users can enter a controlled workspace |
-| Multi-instance deployment | Deploy control creates isolated app-instance containers; router proxy routes by host | Enables per-user or per-team sandboxing |
-| Chat workbench | Conversations, attachments, task cards, current task progress, acceptance controls | Main user workspace |
-| Task runtime | Auto task recognition, task creation, runs, timeline, status, acceptance | Converts chat into managed work |
-| Agent execution | Unified engine, main agent, sub-agent/team execution, sequence/parallel/DAG coordinator | Handles complex work beyond one response |
-| Tools | Built-in tools, MCP tools, tool management UI | Lets Agents act on files, web, terminal, integrations |
-| Files | Workspace file browser, upload, preview, download, delete | Gives AI and users a shared working surface |
-| Skills | Published skills, candidates, drafts, safety/eval, review, publish | Turns accepted work into reusable methods |
-| Marketplace | Skill discovery/install flow | Foundation for capability distribution |
-| Memory | Backend long-term memory foundation exists, product integration still incomplete | Future compounding personalization and organization knowledge |
-| Scheduled work | Cron jobs, notifications, scheduled task flows | Moves from reactive chat to proactive work |
-| Connectors | Outlook and external connector architecture; Feishu/Weixin-related sidecar paths | Brings Agent into real business channels |
-| Settings/status/logs | Provider config, agent config, channel config, runtime status, restart | Admin control and troubleshooting |
-
-### Current Value Proposition
-
-For enterprise teams:
-
-> Beaver provides a private Agent workspace where AI work is executed, tracked, reviewed, and reused. It gives teams the speed of AI assistance with the control needed for real business workflows.
-
-For product pilots:
-
-> Beaver is strongest when a team has recurring knowledge work that crosses files, tools, systems, and reviews.
-
-### Current Challenges
-
-| Challenge | Why It Matters |
-| --- | --- |
-| Product breadth is large | Buyers may not understand what to adopt first |
-| Memory is partly backend-ready but not fully productized | "越用越懂" story needs visible control |
-| Connector maturity varies by channel | Customer demos must avoid overpromising |
-| Multi-instance deployment is powerful but operationally sensitive | Pilot success depends on stable setup and clear runbooks |
-| Skill learning needs strong governance | Reuse can become risk if publishing is weak |
-| Metrics are not yet productized | Hard to prove pilot value without baseline and target |
-| Customer research is not yet captured | Current roadmap is inferred from implementation and product judgment |
-
-## User Segments
-
-### Segment 1: Enterprise AI Platform Owner
-
-They need to safely introduce Agent capability into an organization. Their concern is not whether an LLM can answer a question; it is whether teams can use it without losing control of data, tools, cost, and quality.
-
-### Segment 2: Workflow Owner
-
-They own a recurring process such as weekly reporting, project status, proposal drafting, research, operations follow-up, support triage, or document review. They want less manual coordination and more repeatable output.
-
-### Segment 3: Individual Knowledge Worker
-
-They want one workspace where they can chat, upload files, run tools, generate artifacts, and continue a task until the output is usable.
-
-### Segment 4: Admin / Operator
-
-They need to create instances, configure models, monitor status, debug logs, manage connectors, and keep deployment safe.
-
-### Segment 5: Skill Maintainer
-
-They curate reusable skills, review drafts, evaluate safety, publish stable versions, and prevent low-quality automation from spreading.
-
-## JTBD
-
-| User | Job Story | Current Alternative | Beaver Outcome |
-| --- | --- | --- | --- |
-| Platform owner | When teams ask for AI tools, I want a controlled Agent workspace so they can experiment without unmanaged SaaS sprawl | ChatGPT accounts, custom scripts, internal demos | Private, governed Agent workspace |
-| Workflow owner | When a recurring process takes many manual steps, I want AI to execute and track it so my team can review the result | Manual docs, spreadsheets, Slack/email coordination | Task with timeline, artifacts, acceptance |
-| Knowledge worker | When I ask AI to produce something, I want to revise and accept it as work, not just receive a message | Chat thread and copy/paste | Task lifecycle and deliverable loop |
-| Admin | When a user registers, I want a workspace created and routed automatically so onboarding is repeatable | Manual container setup | Auth portal + deploy control + router proxy |
-| Skill maintainer | When a task succeeds, I want to turn its method into a reusable skill so future tasks improve | Prompt docs, tribal knowledge | Skill candidate/draft/review/publish |
-| Security reviewer | When Agents use tools, I want evidence and boundaries so I can audit behavior | Opaque model/tool calls | Tool traces, task evidence, instance sandbox |
-
-## Opportunity Areas
-
-Opportunity scores are qualitative estimates from current docs and product context. They need validation with customer interviews and pilot data.
-
-| Opportunity | Importance | Current Satisfaction | Opportunity Score | Notes |
-| --- | ---: | ---: | ---: | --- |
-| I need AI outputs to become reviewable tasks, not loose chat replies | 0.95 | 0.30 | 0.67 | Core wedge |
-| I need evidence of what the Agent did | 0.90 | 0.35 | 0.59 | Governance driver |
-| I need repeatable workflows to become reusable skills | 0.85 | 0.40 | 0.51 | Learning moat |
-| I need private deployment and instance boundaries | 0.90 | 0.45 | 0.50 | Enterprise adoption |
-| I need AI to work across files, tools, and external systems | 0.85 | 0.45 | 0.47 | Workflow depth |
-| I need proactive scheduled work, not only reactive chat | 0.70 | 0.45 | 0.39 | Expansion opportunity |
-| I need memory that I can inspect and control | 0.80 | 0.25 | 0.60 | High future leverage |
-
-Top opportunities:
-
-1. Make AI work reviewable and acceptable.
-2. Make process evidence and governance visible.
-3. Turn accepted work into reusable skills and memory.
-
-## Product Positioning
-
-Recommended primary positioning:
-
-> Beaver is an enterprise Agent execution and governance platform for repeatable knowledge work.
-
-Supporting message:
-
-> It gives teams a private Agent sandbox where AI can use tools, manage files, execute tasks, record evidence, ask for acceptance, and learn reusable skills from approved work.
-
-Avoid positioning Beaver as:
-
- A generic chatbot.
- A pure model gateway.
- A standalone RPA replacement.
- A developer-only Agent framework.
- A marketplace-only skill product.
-
-## Competitive Frame
-
-| Category | Strength | Gap Beaver Addresses |
-| --- | --- | --- |
-| AI chat apps | Fast answers and content generation | Weak task lifecycle, evidence, acceptance, and reuse |
-| RPA / automation | Repeatable process execution | Rigid flows, harder natural-language adaptation |
-| Agent frameworks | Developer flexibility | Missing complete user workspace and governance surface |
-| Internal scripts | Fast local automation | Poor product UX, auditability, onboarding, and scaling |
-| Enterprise AI platforms | Governance and admin | Often weak on task-level execution and skill learning loop |
-
-## Product Ideas
-
-Generated from PM, design, and engineering perspectives.
-
-### PM Ideas
-
-1. Pilot Workflow Templates: package 3-5 high-value workflows such as weekly report, project brief, support triage, document review.
-2. Team Workspace Mode: group multiple users under one organization workspace with shared skills and controlled memory.
-3. Governance Scorecard: show evidence coverage, accepted tasks, skill reuse, failed runs, and tool risk.
-4. Skill Quality Lifecycle: strengthen candidate -> draft -> safety -> eval -> review -> publish -> version rollback.
-5. ROI Dashboard: measure time saved, accepted tasks, revision rounds, reusable skill adoption.
-
-### Design Ideas
-
-1. Work Inbox: unify tasks, scheduled runs, notifications, and pending reviews.
-2. Task Evidence Narrative: convert raw events into readable "what happened" timeline.
-3. Memory Control Center: show what Beaver remembers, why, source, confidence, and edit/delete controls.
-4. First-Run Product Tour: guide a new user from provider setup to first accepted task.
-5. Admin Health Console: one page for instance, provider, connector, queue, and runtime health.
-
-### Engineering Ideas
-
-1. Tenant/Workspace Policy Profiles: control allowed tools, connectors, memory behavior, and publish gates per deployment.
-2. Connector Sandbox Layer: test external channel actions without touching production systems.
-3. Unified Evidence Schema: normalize task, tool, artifact, skill, memory, and connector events.
-4. Replay-Based Skill Evaluation: evaluate skill drafts against historical accepted runs.
-5. Instance Lifecycle Automation: suspend, resume, backup, restore, rotate secrets, inspect health.
-
-Top 5 product ideas to pursue:
-
-| Rank | Idea | Why Selected | Assumptions |
-| ---: | --- | --- | --- |
-| 1 | Pilot Workflow Templates | Gives customers a concrete starting point | Initial buyers share common workflows |
-| 2 | Task Evidence Narrative | Makes governance understandable | Reviewers value readable evidence |
-| 3 | Memory Control Center | Unlocks long-term differentiation | Users trust memory if they can inspect/control it |
-| 4 | Governance Scorecard | Helps buyers justify adoption | Platform owners need measurable proof |
-| 5 | Instance Lifecycle Automation | Reduces pilot operational risk | Deployments will grow beyond a few instances |
-
-## Key Assumptions
-
-| Assumption | Category | Impact | Uncertainty |
-| --- | --- | ---: | ---: |
-| Enterprise teams feel enough pain with chat-only AI to adopt an Agent workspace | Value | High | Medium |
-| Task acceptance is a meaningful quality signal | Value | High | Medium |
-| Users will tolerate a task workflow instead of expecting instant chat only | Usability | High | Medium |
-| Per-instance deployment is operationally acceptable for early customers | Feasibility | High | Medium |
-| Workflow owners can identify repeatable tasks worth piloting | Value | High | Low |
-| Skill reuse creates visible productivity gains | Business Viability | High | High |
-| Memory control is required before customers trust long-term memory | Trust | High | Medium |
-| Connectors are necessary for customer stickiness | Value | Medium | Medium |
-| Admins can manage model provider configuration without heavy support | Usability | Medium | Medium |
-| The team can maintain broad product surface without quality drift | Team Capability | High | High |
-
-## Prioritized Assumptions
-
-### P0 Validate Immediately
-
-| Assumption | Why It Matters | What Could Go Wrong | Validation |
-| --- | --- | --- | --- |
-| Customers prefer task-based AI execution over chat-only for real work | Core product wedge | Users see tasks as overhead | Run 3 workflow pilots and compare chat-only vs task loop |
-| Evidence timeline increases trust | Governance story depends on it | Evidence is too technical or noisy | Reviewer usability test with task timelines |
-| Private multi-instance deployment is acceptable | Adoption depends on ops fit | Setup too fragile or expensive | Deploy pilot on fresh Linux host and measure time/errors |
-| Accepted tasks can generate reusable skills that users value | Learning loop depends on this | Skills are low quality or unused | Track reuse of skills from accepted pilot tasks |
-
-### P1 Important
-
-| Assumption | Why It Matters | Validation |
-| --- | --- | --- |
-| Memory control center is required before broad rollout | Trust and differentiation | Interview pilot admins and users |
-| Connectors drive retention | External systems make workflows real | Compare pilot workflows with and without Outlook/IM connectors |
-| Scheduled work creates high-value usage | Moves Beaver from reactive to proactive | Test weekly report and reminder workflows |
-| Marketplace/skill distribution is a buyer requirement | Scaling reuse across teams | Ask platform owners during procurement |
-
-### P2 Later
-
-| Assumption | Why It Matters | Validation |
-| --- | --- | --- |
-| Multi-user team workspace is required for first paid pilots | Could reshape architecture | Validate with buyer interviews |
-| Fine-grained per-tool policies are needed in UI | Admin complexity | Observe support requests |
-| Cross-instance organization analytics is needed early | Enterprise reporting | Validate after 2-3 pilots |
-
-## Opportunity Solution Tree
-
-Desired outcome:
-
-> Within 90 days, prove that a pilot team can complete repeatable AI-assisted work with acceptance, evidence, and reuse: at least 30 accepted tasks, 5 reusable skills, 2 recurring workflows, and 0 critical deployment/security incidents.
-
-```text
-Outcome: Trusted repeatable Agent work in pilot teams
-
-Opportunity 1: I need AI outputs to become reviewable deliverables.
-  Solution 1.1: Task lifecycle with acceptance and revision.
-    Experiment: Run a project brief workflow and measure accepted output rate.
-  Solution 1.2: Task details page with evidence narrative.
-    Experiment: Ask reviewers to reconstruct what happened from timeline.
-  Solution 1.3: Work Inbox for pending reviews and scheduled outputs.
-    Experiment: Fake-door navigation item and measure clicks/asks.
-
-Opportunity 2: I need confidence that Agent tool use is controlled.
-  Solution 2.1: Tool traces and artifact timeline.
-    Experiment: Security review of 5 real tasks.
-  Solution 2.2: Admin health and policy console.
-    Experiment: Operator performs setup/debug checklist on fresh instance.
-  Solution 2.3: Connector sandbox and side-effect journals.
-    Experiment: Test external send/reply flows in sandbox mode.
-
-Opportunity 3: I need successful work to become reusable.
-  Solution 3.1: Skill candidate -> draft -> review -> publish.
-    Experiment: Convert 5 accepted tasks into skills and track reuse.
-  Solution 3.2: Memory Control Center.
-    Experiment: Prototype memory review UI and test trust/comprehension.
-  Solution 3.3: Pilot workflow templates.
-    Experiment: Package 3 templates and measure first-task success rate.
-```
-
-## Validation Experiments
-
-| Assumption | Hypothesis | Experiment | Duration | Success Criteria |
-| --- | --- | --- | --- | --- |
-| Task loop beats chat-only | Users complete more usable work with task acceptance than plain chat | Same workflow performed in chat-only and Beaver task loop | 1 week | Beaver output accepted in fewer revision rounds |
-| Evidence creates trust | Reviewers can understand and audit what happened | Give 5 timelines to reviewers | 2 days | >=80% identify tools, artifacts, result, and risk |
-| Deployment is pilot-ready | Fresh host setup is repeatable | Deploy on clean Linux/WSL2 machine using docs | 1 day | Setup under 2 hours with no undocumented step |
-| Skills create reuse | Accepted tasks can become useful skills | Convert 5 pilot tasks into skills | 2 weeks | 3 skills reused at least twice |
-| Memory needs control UI | Users trust memory more with inspect/edit/delete | Clickable prototype or simple page | 3 days | >=80% say they would enable memory with controls |
-| Scheduled work matters | Recurring workflows create repeat usage | Weekly report or reminder pilot | 2-4 weeks | At least 2 recurring jobs run and get accepted outputs |
-
-## Feature Prioritization
-
-### Must Have
-
-| Feature | Impact | Effort | Risk | Reason |
-| --- | --- | --- | --- | --- |
-| Auth portal and instance onboarding | High | High | Medium | Required for any user to start |
-| Provider configuration flow | High | Medium | Medium | Model access is prerequisite |
-| Chat workbench | High | High | Medium | Primary user surface |
-| Task lifecycle and acceptance | High | High | Medium | Core differentiation |
-| Task timeline/evidence | High | High | Medium | Governance and review |
-| Files workspace | High | Medium | Medium | Most real workflows need files |
-| Tool management | High | Medium | High | Agents need controlled action surface |
-| Skills review/publish | High | High | High | Reuse loop |
-| Settings/status/logs | High | Medium | Medium | Operational support |
-| Basic deployment guide/runbook | High | Medium | Medium | Pilot readiness |
-
-### Should Have
-
-| Feature | Impact | Effort | Risk | Reason |
-| --- | --- | --- | --- | --- |
-| Pilot workflow templates | High | Medium | Low | Creates adoption path |
-| Evidence narrative layer | High | Medium | Medium | Makes audit readable |
-| Memory Control Center | High | High | Medium | Unlocks long-term trust |
-| Skill replay/eval hardening | High | High | High | Makes learning safer |
-| Scheduled workflow polish | Medium | Medium | Medium | Supports proactive use cases |
-| Connector onboarding polish | Medium | High | High | Needed for real systems |
-| Admin health console | Medium | Medium | Medium | Reduces support load |
-
-### Could Have
-
-| Feature | Reason |
-| --- | --- |
-| Multi-user organization workspace | Valuable, but changes scope and permissions |
-| Cross-instance analytics | Useful after multiple deployments |
-| Fine-grained policy UI | Need policy demand before UI complexity |
-| Audit export | Strong sales support, not first pilot blocker |
-| Cost/quality model router | Useful after usage volume grows |
-
-### Not Yet
-
-| Feature | Reason |
-| --- | --- |
-| Broad public SaaS launch | Product and ops need pilot hardening first |
-| Fully autonomous publish of skills | Human review should remain mandatory |
-| Production writes through connectors without review | Trust risk |
-| Complex enterprise RBAC before pilot validation | May overbuild before segment clarity |
-
-## Metrics Dashboard
-
-### North Star Metric
-
-Accepted Agent Workflows:
-
-> Number of AI-assisted tasks or scheduled workflows accepted by users per active pilot team per week.
-
-Why this metric: it captures real delivered value better than messages sent, tokens used, or model calls.
-
-### Input Metrics
-
-| Metric | Definition | Target For Pilot |
-| --- | --- | --- |
-| Task Creation Rate | Tasks created / active users / week | Increasing weekly |
-| Acceptance Rate | Accepted task runs / completed task runs | >=60% in pilot |
-| Revision Rate | Runs needing revision / completed runs | Track down over time |
-| Evidence Coverage | Task runs with timeline/tool/artifact evidence / task runs | >=90% |
-| Skill Candidate Rate | Accepted tasks producing candidates / accepted tasks | >=20% after week 2 |
-| Skill Reuse Rate | Runs activating published pilot skills / task runs | >=15% after skills exist |
-| Scheduled Success Rate | Accepted scheduled outputs / scheduled runs | >=50% for selected workflows |
-| Deployment Success Time | Fresh deployment time to first working user | <2 hours for pilot |
-
-### Guardrail Metrics
-
-| Metric | Alert |
-| --- | --- |
-| Critical tool/security incident | Any occurrence |
-| Instance creation failure rate | >10% in pilot |
-| Provider configuration failure rate | >20% |
-| Task run failure rate | >20% for 2 consecutive days |
-| Connector side-effect incident | Any unintended external write |
-| User file permission/storage incident | Any cross-user or cross-instance leak |
-| p95 task completion latency | Exceeds pilot workflow tolerance |
-
-### Business Metrics
-
- Pilot activation: teams reaching first accepted task.
- Time to first accepted task.
- Weekly active task users.
- Repeated workflow count.
- Skill reuse per team.
- Customer-reported time saved.
- Pilot conversion intent.
-
-## Customer Research Plan
-
-No direct interview transcripts were provided. Research should start immediately before locking roadmap.
-
-### Participants
-
- 5 knowledge workers with recurring document/report/research workflows.
- 3 workflow owners or team leads.
- 3 enterprise AI platform/admin stakeholders.
- 2 security or IT reviewers.
- 2 engineers/operators who would deploy and maintain Beaver.
-
-### Questions
-
- What recurring work is painful enough to delegate to an Agent?
- What would make an AI output "acceptable" instead of just "interesting"?
- What evidence do you need to trust Agent work?
- What systems must the Agent connect to for the workflow to matter?
- What would make you stop a pilot?
- What memory or reuse behavior feels helpful vs risky?
- What does a successful 30-day pilot need to prove?
-
-## Interview Guide
-
-### Opening
-
-We are studying how teams move AI from chat into real work. We are not asking whether you like an idea. We want examples of work you recently did.
-
-### Current Behavior
-
- Walk me through the last time you used AI for a real work deliverable.
- What happened after the AI gave an answer?
- What did you copy, edit, verify, or redo manually?
- Who reviewed the result?
-
-### Pain
-
- What was the slowest or most annoying part?
- What made the output hard to trust?
- What tools or files were involved?
- What evidence did you need but did not have?
-
-### Reuse
-
- Have you repeated a similar workflow since then?
- Did you reuse prompts, templates, scripts, or notes?
- What would make that reuse safe for a team?
-
-### Governance
-
- What AI actions would need approval?
- What data or tools should be off limits?
- Who needs to see the history of what happened?
-
-### Pilot
-
- Which one workflow would you test first?
- What result would make you expand usage?
- What failure would make you stop?
-
-## Recommended Next 30 Days
-
-1. Pick 2-3 pilot workflows: project brief, weekly report, document review, support triage, or file processing.
-2. Run fresh deployment rehearsal from README/deployment guide and record gaps.
-3. Define pilot metrics and instrument accepted tasks, revisions, skill candidates, skill reuse, and run failures.
-4. Create a task evidence narrative prototype on top of existing timeline data.
-5. Package pilot workflow templates as skills or documented demos.
-6. Validate provider onboarding with 3 non-engineer users.
-7. Run security review for file boundaries, tool execution, connectors, and deploy-control exposure.
-8. Decide which connector(s) are pilot-safe.
-
-## Recommended Next 90 Days
-
-1. Complete Memory Control Center MVP.
-2. Harden skill learning with replay/eval and publish gates.
-3. Add Admin Health Console for provider, instance, connector, task queue, and runtime status.
-4. Improve instance lifecycle: suspend, resume, backup, restore, rotate secrets.
-5. Add customer-facing pilot scorecard.
-6. Formalize tool/connector policy profiles.
-7. Expand pilot from one workflow to one department.
-8. Build audit export after evidence narrative stabilizes.
-
-## Biggest Risks
-
-| Risk | Severity | Mitigation |
-| --- | --- | --- |
-| Product appears too broad and hard to adopt | High | Lead with pilot workflows and task loop |
-| Deployment complexity blocks pilots | High | Rehearsed runbook, health checks, support checklist |
-| Agent actions cause unintended side effects | Critical | Conservative tool policy, explicit connector sandboxing, evidence logs |
-| Task evidence is too technical | High | Evidence narrative and reviewer testing |
-| Skill learning publishes poor methods | High | Human review, safety/eval, replay validation |
-| Memory feels creepy or uncontrollable | High | Memory control UI before broad enablement |
-| Team spreads effort across too many modules | High | Prioritize task loop, evidence, skills, deployment reliability |
-
-## Recommended Immediate Actions
-
-1. Reframe all main product docs around Beaver as an Agent execution and governance platform.
-2. Treat Skill Replay Eval as a subfeature under the skill governance loop.
-3. Build the next roadmap around pilot workflows, not isolated modules.
-4. Make accepted tasks the main success metric.
-5. Productize memory and evidence before adding many new connectors.
-6. Prove deployment repeatability before selling broad private deployments.
--- a/docs/product-discovery/beaver/product-prd.html
+++ b/docs/product-discovery/beaver/product-prd.html
--- a/docs/product-discovery/beaver/validation-plan.md
+++ b/docs/product-discovery/beaver/validation-plan.md
@ -1,378 +0,0 @@
-# Beaver Validation Plan
-
-Date: 2026-06-09
-
-Purpose: validate Beaver as a whole product before broader rollout.
-
-## 1. Validation Strategy
-
-Beaver should be validated through real workflows, not through opinions about AI.
-
-The validation sequence:
-
-```text
-customer problem
-  -> workflow fit
-  -> first-run onboarding
-  -> task execution
-  -> evidence comprehension
-  -> acceptance/revision
-  -> skill reuse
-  -> deployment and operations
-  -> security/governance
-```
-
-## 2. Validation Questions
-
-### Product Value
-
- Does Beaver solve a painful enough workflow problem?
- Does task acceptance make AI work feel more reliable?
- Do users complete more usable work than with chat-only AI?
- Does skill reuse save time after repeated workflows?
-
-### Usability
-
- Can users understand when chat becomes a task?
- Can users find task evidence and artifacts?
- Can users accept, revise, or abandon without confusion?
- Can admins configure providers and connectors without engineering help?
-
-### Technical Feasibility
-
- Can fresh deployments be created repeatably?
- Can app instances stay isolated?
- Can Agent tasks run reliably with files, tools, skills, and scheduled jobs?
- Can failures be diagnosed from status/logs/events?
-
-### Governance And Security
-
- Are control-plane services private?
- Are file and workspace boundaries enforced?
- Are tool calls recorded and reviewable?
- Are external connector writes controlled?
- Is memory inspectable and controllable before broad use?
-
-### Business Viability
-
- Does a pilot team have enough recurring workflows?
- Can the product produce measurable weekly value?
- Can an admin operate it with acceptable support load?
- Can the buyer justify expansion?
-
-## 3. Pilot Workflow Candidates
-
-| Workflow | Why It Fits | Required Capabilities | Success Signal |
-| --- | --- | --- | --- |
-| Weekly project report | Recurring, evidence-sensitive, review-heavy | scheduled work, files, task acceptance, artifacts | Report accepted weekly |
-| Project brief / proposal | Multi-step, document-heavy, revision-heavy | chat, files, tools, task timeline, revisions | Brief accepted after fewer rounds |
-| Document review | Clear deliverable and evidence need | files, task timeline, artifacts, acceptance | Review output accepted |
-| Support triage | Tool/context-heavy and repeatable | tasks, tools, memory, maybe connector | Triage summary accepted |
-| Research synthesis | Agent team fit, artifact-heavy | multi-agent, web/search, files, evidence | Synthesis accepted and reused |
-
-Recommended first pilot:
-
-1. Project brief or document review for manual task loop.
-2. Weekly project report for scheduled workflow.
-3. Skill reuse from the accepted outputs.
-
-## 4. Customer Discovery Validation
-
-### Participants
-
- 5 end users.
- 3 workflow owners.
- 3 admins/platform owners.
- 2 security reviewers.
- 2 operators/engineers.
-
-### Method
-
- 45-minute interviews using past-behavior questions.
- 60-minute workflow walkthrough with Beaver.
- Follow-up after one week of usage.
-
-### Evidence To Collect
-
- Current workflow steps.
- Time spent today.
- Existing tools/files/systems involved.
- Review/approval requirements.
- Trust blockers.
- Repeat frequency.
- What would count as a successful pilot.
-
-### Pass Criteria
-
- At least 3 workflows are repeated weekly or more.
- At least 2 workflows involve files or external tools.
- At least 2 stakeholders require evidence/auditability.
- At least 1 team lead agrees to a real pilot workflow.
-
-## 5. Product Workflow Validation
-
-### Test 1: First Accepted Task
-
-Goal: user reaches first accepted task.
-
-Steps:
-
-1. Register or log in.
-2. Configure provider.
-3. Start from a suggested workflow or freeform chat.
-4. Upload or reference a file if needed.
-5. Let Beaver create/continue a task.
-6. Inspect output and evidence.
-7. Accept or request revision.
-
-Pass criteria:
-
- User completes without developer assistance.
- First accepted task occurs in one session.
- User can explain what Beaver did.
-
-### Test 2: Revision Loop
-
-Goal: prove Beaver handles "not good enough yet."
-
-Steps:
-
-1. Run a task.
-2. Ask for a specific revision.
-3. Confirm the same task context continues.
-4. Accept revised output.
-
-Pass criteria:
-
- Revision feedback is preserved.
- Task timeline shows revision.
- User does not need to restate full context.
-
-### Test 3: Evidence Review
-
-Goal: verify trust and auditability.
-
-Steps:
-
-1. Give reviewer a completed task detail page.
-2. Ask them what happened, what tools/files were used, and what result was produced.
-3. Ask whether they would approve the output.
-
-Pass criteria:
-
- >=80% reviewers identify the key actions and artifacts.
- Reviewers can state at least one risk or confidence reason.
-
-### Test 4: Skill Reuse
-
-Goal: prove accepted work can compound.
-
-Steps:
-
-1. Accept a task.
-2. Generate skill candidate/draft.
-3. Review and publish skill.
-4. Run a similar task.
-5. Check whether skill activates and improves work.
-
-Pass criteria:
-
- At least 3 pilot skills are reused twice.
- Users report lower effort on repeated task.
-
-### Test 5: Scheduled Workflow
-
-Goal: validate proactive work.
-
-Steps:
-
-1. Create scheduled job.
-2. Trigger or wait for scheduled run.
-3. Review notification/output.
-4. Accept or revise.
-
-Pass criteria:
-
- Scheduled run is visible.
- Output enters review flow.
- Failed run has clear recovery path.
-
-## 6. Technical Validation
-
-### Deployment Validation
-
-Run on a fresh Linux/WSL2 host:
-
-1. Build images.
-2. Create Docker network.
-3. Start router proxy.
-4. Start authz service.
-5. Start deploy control.
-6. Start auth portal.
-7. Register user.
-8. Configure provider.
-9. Open app instance.
-10. Complete first task.
-
-Pass criteria:
-
- Under 2 hours with docs only.
- No undocumented environment variables.
- Public exposure limited to auth portal and router proxy.
-
-### Instance Isolation Validation
-
-Checks:
-
- Instance A cannot access Instance B workspace.
- User file roots stay scoped.
- Router sends host to correct container.
- Provider config is instance-specific.
- Deleting one instance does not affect another.
-
-Pass criteria:
-
- No cross-instance reads/writes.
- Registry state remains consistent.
-
-### Runtime Validation
-
-Checks:
-
- Chat API.
- WebSocket/runtime status.
- Task creation and deletion.
- Task detail events.
- File upload/preview/download/delete.
- Tool test.
- Skill candidate/draft/review/publish.
- Cron create/toggle/run/delete.
- Settings provider save.
- Runtime restart.
-
-Pass criteria:
-
- Critical user flows pass on desktop and mobile viewport.
- Failure states have visible recovery.
-
-## 7. Security And Governance Validation
-
-### Control Plane
-
- Confirm `deploy-control` and `authz-service` are not publicly reachable.
- Confirm tokens are required for control-plane calls.
- Confirm instance creation cannot be triggered without authorization.
-
-### Files
-
- Confirm only allowed user roots are visible.
- Confirm absolute-style or cross-prefix paths are rejected.
- Confirm delete operations require explicit user action.
-
-### Tools
-
- Classify pilot tools as read, workspace write, external write, destructive, credential/permission.
- Record tool calls in task evidence.
- Block or require review for dangerous actions.
-
-### Connectors
-
- Use sandbox/test accounts for pilot when possible.
- Confirm callback base URL is per-instance.
- Confirm disconnect/reconnect path.
-
-### Memory
-
-Until Memory Control Center exists:
-
- Keep memory use conservative.
- Document what is stored.
- Avoid enabling opaque long-term memory for sensitive pilots.
-
-## 8. Usability Validation
-
-Viewports:
-
- 320px.
- 375px.
- 390px.
- 768px.
- 1024px.
- 1365px.
- One mobile landscape viewport.
-
-Screens:
-
- Auth portal login/register/provider onboarding.
- Chat workbench.
- Task list/detail.
- Files.
- Skills.
- Marketplace.
- Tools.
- Notifications/cron.
- Outlook/connectors if in pilot.
- Settings/status/logs.
-
-Pass criteria:
-
- No horizontal overflow.
- No inaccessible critical controls.
- Touch targets are usable.
- Loading, empty, error, success, and disabled states are visible.
-
-## 9. Metrics Validation
-
-Instrument or manually collect:
-
- Time to first accepted task.
- Accepted tasks per user/team/week.
- Acceptance rate.
- Revision rate.
- Task run failure rate.
- Evidence coverage.
- Skill candidates.
- Skill drafts.
- Published skills.
- Skill reuse.
- Scheduled run success.
- Provider setup failure.
- Instance creation failure.
- Connector setup failure.
-
-Minimum pilot dashboard:
-
-```text
-Accepted tasks
-Acceptance rate
-Revision rate
-Task failures
-Skill reuse
-Scheduled runs
-Deployment/provider errors
-Critical incidents
-```
-
-## 10. Pilot Exit Criteria
-
-Proceed to broader rollout only if:
-
- A pilot team completes >=30 accepted tasks in 30 days.
- At least 2 recurring workflows are active.
- At least 5 skills are created and 3 reused twice.
- Task acceptance rate is >=60%.
- No critical security or deployment incidents occur.
- Fresh deployment can be repeated from docs.
- Admin can diagnose common failures from status/logs/runbook.
- Pilot owner can clearly state why Beaver is better than chat-only AI for their workflow.
-
-## 11. Decision Matrix
-
-| Result | Decision |
-| --- | --- |
-| High task acceptance, low skill reuse | Improve skill learning and workflow templates |
-| High interest, deployment friction | Invest in deploy runbook and health console |
-| Good demos, low recurring use | Revisit target segment and workflow selection |
-| High usage, trust concerns | Prioritize evidence narrative, policy, memory controls |
-| Connector demand dominates | Narrow connector roadmap to one high-value system |
-| Memory concerns dominate | Build Memory Control Center before expansion |