Refactor app instance to Keycloak SSO

2026-06-15 15:54:39 +08:00
parent fc9fd93c36
commit 461d1300ad
246 changed files with 1350 additions and 52721 deletions
--- a/docs/product-discovery/beaver/PRD-beaver-agent-sandbox.md
+++ b/docs/product-discovery/beaver/PRD-beaver-agent-sandbox.md
@ -1,489 +0,0 @@
-# PRD: Beaver Agent Sandbox
-
-Date: 2026-06-09
-
-Status: Product discovery draft for whole Beaver product
-
-## 1. Summary
-
-Beaver Agent Sandbox is a private-deployable workspace for enterprise Agent work. It lets users move from chat to managed tasks, execute work with files and tools, track evidence, accept or revise outputs, and turn successful work into reusable skills and memory.
-
-The first product goal is to prove that Beaver can help a pilot team complete repeatable knowledge work with more control, traceability, and reuse than chat-only AI tools.
-
-## 2. Contacts
-
-| Role | Owner | Comment |
-| --- | --- | --- |
-| Product owner | TBD | Owns positioning, roadmap, pilot metrics, research |
-| Engineering owner | TBD | Owns platform architecture and implementation quality |
-| Design owner | TBD | Owns workspace, task, review, admin, and onboarding UX |
-| Deployment owner | TBD | Owns Docker deployment, routing, instance lifecycle |
-| Security/review owner | TBD | Owns tool policy, data boundaries, connector safety |
-| Pilot owner | TBD | Owns customer/team selection and feedback loop |
-
-## 3. Background
-
-Most enterprise AI experiments start with chat. Chat is useful, but it is weak at real work:
-
- There is no durable task lifecycle.
- It is hard to see what happened.
- File and tool work is scattered.
- Results are not formally accepted or rejected.
- Successful workflows are not turned into reusable team capability.
- Admins cannot easily control deployment, tools, memory, and connectors.
-
-Beaver addresses this gap by acting as an Agent execution and governance layer. It combines a user workspace, task runtime, evidence timeline, file and tool operations, skill learning, scheduled work, connectors, and private multi-instance deployment.
-
-Why now:
-
- Teams are moving from AI demos to operational AI workflows.
- Enterprise buyers need governance, not only model access.
- Beaver already has enough implementation to support pilot workflows.
- The next step is product packaging, validation, and operational hardening.
-
-## 4. Objective
-
-### Objective
-
-Prove Beaver can deliver trusted, repeatable Agent work for pilot teams.
-
-### Key Results
-
-| Key Result | Target |
-| --- | --- |
-| Time to first accepted task | Pilot user reaches first accepted task within first session |
-| Accepted Agent Workflows | >=30 accepted tasks across pilot team within 30 days |
-| Acceptance Rate | >=60% of completed task runs accepted |
-| Evidence Coverage | >=90% of task runs show useful timeline/tool/artifact evidence |
-| Skill Reuse | >=5 reusable skills created, >=3 reused at least twice |
-| Deployment Repeatability | Fresh pilot deployment under 2 hours with documented steps |
-| Critical Incidents | 0 control-plane exposure, data leakage, or unintended external-write incidents |
-
-## 5. Market Segments
-
-### Primary Segment: Enterprise Teams Doing Repeatable Knowledge Work
-
-Examples:
-
- Project delivery teams.
- Operations teams.
- Internal strategy/research teams.
- Technical support and engineering teams.
- Customer success and sales operations teams.
-
-Their work is a good fit when it is:
-
- Repeated often.
- Multi-step.
- File-heavy.
- Tool-heavy.
- Needs review or approval.
- Benefits from a traceable process.
-
-### Buyer Segment: AI Platform Owner / IT Leader
-
-They need to provide AI capability without losing control over deployment, data, tools, and governance.
-
-### Admin Segment: Operator / Implementation Owner
-
-They set up Beaver, manage model providers, monitor health, handle connectors, and support users.
-
-### Maintainer Segment: Skill Owner
-
-They curate reusable skills and make sure published skills are safe, useful, and reviewable.
-
-## 6. Value Propositions
-
-### For Workflow Teams
-
-Beaver turns AI conversations into managed work. A request can become a task, produce artifacts, show evidence, and continue through revision until accepted.
-
-### For Platform Owners
-
-Beaver offers a private Agent sandbox with instance boundaries, tool governance, skills, and operational controls.
-
-### For Admins
-
-Beaver makes onboarding and operations more repeatable through auth portal, deploy control, routing, settings, status, and logs.
-
-### For Skill Maintainers
-
-Beaver turns accepted work into reusable skills through candidate, draft, safety/eval, review, and publish flow.
-
-### For End Users
-
-Beaver gives one place to chat, upload files, run tasks, preview outputs, review results, and reuse proven methods.
-
-## 7. Solution
-
-### 7.1 User Experience
-
-#### First-Run Experience
-
-```text
-User registers
-  -> app instance is created
-  -> user configures model provider
-  -> user enters Beaver workspace
-  -> user starts from a workflow template or chat
-  -> Beaver creates or continues a task
-  -> user accepts or revises the result
-```
-
-Requirements:
-
- Registration and instance provisioning must show clear progress and errors.
- Provider setup must be understandable and recoverable.
- If provider setup is skipped, the app must clearly explain why Agent calls cannot run.
-
-#### Daily User Workspace
-
-Primary screens:
-
- Chat workbench.
- Task list and task details.
- Files.
- Notifications and scheduled work.
- Skills and marketplace.
- Tool management.
- Settings/status/logs.
-
-Core user loop:
-
-```text
-Ask
-  -> execute
-  -> inspect evidence
-  -> accept/revise
-  -> reuse
-```
-
-#### Admin Experience
-
-Admin needs:
-
- See instance health.
- Configure providers.
- Configure channels/connectors.
- Restart safely.
- Inspect logs.
- Manage tools and skills.
- Understand failures.
-
-### 7.2 Key Features
-
-#### Authentication And Instance Provisioning
-
-Requirements:
-
- Users register or log in through auth portal.
- Registration triggers an app-instance container.
- Router maps instance host to container.
- Provider onboarding can configure model provider after instance creation.
-
-Acceptance criteria:
-
- New user can reach a working instance.
- Failed provisioning shows a recoverable error.
- `deploy-control` and `authz-service` are not public surfaces.
-
-#### Chat Workbench
-
-Requirements:
-
- Users can create/select sessions.
- Users can send text and attachments.
- Users can see Assistant messages, task cards, Agent run progress, and acceptance controls.
- Users can jump from chat to task detail.
-
-Acceptance criteria:
-
- User can complete one full chat-to-task-to-accept flow.
- Attachments can be uploaded and used.
- Current task status is visible.
-
-#### Task Lifecycle
-
-Requirements:
-
- System can distinguish ordinary chat from task requests.
- Task can be created, run, continued, revised, accepted, abandoned, or deleted.
- Task detail shows timeline, runs, tools, artifacts, result, and acceptance controls.
-
-Acceptance criteria:
-
- Task list and detail remain useful on mobile and desktop.
- Acceptance actions are persisted.
- Revision feedback continues the same task context.
-
-#### Agent Team Execution
-
-Requirements:
-
- Complex tasks can be planned as sequence, parallel, or DAG execution.
- Subtasks can bind skills or ephemeral guidance.
- Main Agent synthesizes final answer from evidence.
-
-Acceptance criteria:
-
- Subtask results are visible and debuggable.
- Failed team execution is shown without hiding partial evidence.
-
-#### Files Workspace
-
-Requirements:
-
- Users can upload, create folders, browse, preview, download, and delete files.
- Workspace roots stay understandable.
- File operations are safe within instance boundaries.
-
-Acceptance criteria:
-
- Root and nested directories work.
- Text/Markdown/image preview works.
- Long file names do not break layout.
-
-#### Tools And MCP
-
-Requirements:
-
- Admins can view, test, add, edit, delete, and refresh tools where supported.
- Agent runtime can expose tools based on task and skill context.
- Tool calls are recorded as evidence.
-
-Acceptance criteria:
-
- Tool detail and test flows work.
- Dangerous tools are governed by policy before broad rollout.
-
-#### Skills And Marketplace
-
-Requirements:
-
- Published skills can be listed, inspected, installed, uploaded, disabled, rolled back, or deleted where supported.
- Accepted work can create skill candidates.
- Candidates can become drafts.
- Drafts require safety/eval/review gates before publish.
- Marketplace supports discovery and install.
-
-Acceptance criteria:
-
- Candidate and draft flows do not reset UI state unexpectedly.
- Publish requires review gates.
- Published skill can be reused by later tasks.
-
-#### Memory
-
-Requirements:
-
- Beaver can store long-term preferences, business knowledge, historical task learning, file/artifact memory, tool experience, and reusable workflows.
- Before broad product use, users/admins need memory inspect/edit/delete/freeze controls.
-
-Acceptance criteria for Memory Control Center MVP:
-
- User can see what is remembered.
- User can see source and last-used context.
- User can edit, delete, or freeze memory.
- Task detail can show when memory affected execution.
-
-#### Scheduled Work And Notifications
-
-Requirements:
-
- Users can create scheduled jobs.
- Scheduled runs can produce notifications or tasks.
- Users can review, revise, or accept scheduled outputs.
-
-Acceptance criteria:
-
- Scheduled job can be created, toggled, run now, deleted.
- Scheduled output can enter normal task review flow.
-
-#### Connectors
-
-Requirements:
-
- Beaver can connect to external systems such as Outlook and selected IM/channel connectors.
- Connector status, setup, errors, and reconnect path must be visible.
- External writes require clear policy and safety boundary.
-
-Acceptance criteria:
-
- Pilot-safe connector list is documented.
- External connector callbacks route correctly in multi-instance deployment.
- Failed connector auth or setup is recoverable.
-
-#### Settings, Status, Logs
-
-Requirements:
-
- Users/admins can configure provider, Agent settings, channels, and runtime.
- Status page shows current app health.
- Logs help operators diagnose failures.
- Restart is confirmed before execution.
-
-Acceptance criteria:
-
- Provider save flow works.
- Runtime restart flow is protected by confirmation.
- Long config values do not break UI.
-
-### 7.3 Technology
-
-Frontend:
-
- Next.js app inside `app-instance/frontend`.
- App shell with chat, tasks, files, skills, marketplace, tools, connectors, settings, status, logs.
-
-Backend:
-
- Python Beaver backend inside `app-instance/backend`.
- Unified `beaver.engine` for Agent runtime.
- `beaver.coordinator` for multi-agent execution.
- `beaver.services` for task, cron, process, and application orchestration.
- `beaver.tools` for built-in/MCP tool execution.
- `beaver.skills` for skill loading, learning, review, publishing.
- `beaver.memory` for run memory, skills memory, long-term memory foundation.
- `beaver.interfaces` for web, MCP, channels, CLI/gateway surfaces.
-
-Deployment:
-
- `auth-portal`.
- `authz-service`.
- `deploy-control`.
- `router-proxy`.
- `app-instance`.
- Docker network and per-instance mounted runtime directories.
-
-### 7.4 Data And Evidence
-
-Important product data:
-
- Users and auth handoff.
- Instance registry.
- Provider configuration.
- Conversations and messages.
- Tasks, task runs, run events, timeline events.
- Tool calls and results.
- Files and artifacts.
- Skill receipts, candidates, drafts, safety/eval reports, reviews, published versions.
- Memory records.
- Scheduled jobs and scheduled runs.
- Connector state and events.
-
-Evidence principle:
-
-Every meaningful Agent action should become explainable later.
-
-### 7.5 Assumptions
-
- The best first customers are teams with repeatable knowledge workflows.
- Task acceptance is the right primary quality signal.
- Private deployment is a benefit, not a barrier, for early enterprise pilots.
- Teams will value skill/memory reuse after enough accepted tasks.
- Admins can operate a Docker-based deployment with a clear runbook.
- Memory must be controllable before it can be trusted.
-
-### 7.6 Non-Goals For First Pilot
-
- Broad public SaaS launch.
- Full multi-tenant organization management.
- Fully autonomous skill publishing.
- Production external writes without clear review.
- Complete enterprise RBAC.
- Unlimited connector support.
- Perfect long-term memory automation.
- Replacing human review for high-risk work.
-
-## 8. Release
-
-### Release 0: Internal Demo Readiness
-
-Scope:
-
- Clean local deployment.
- Auth portal registration/login.
- Provider onboarding.
- Chat-to-task demo.
- Task detail evidence.
- File upload/preview.
- Skills and marketplace demo.
- Settings/status/logs.
-
-Exit criteria:
-
- Demo flow works on fresh environment.
- Known limitations are documented.
- No critical security/deployment issue.
-
-### Release 1: Pilot Workflow Release
-
-Scope:
-
- 2-3 packaged workflows.
- Task acceptance and evidence as main flow.
- Files and selected tools.
- Basic scheduled workflow.
- One pilot-safe connector if stable.
- Skill candidate/draft/review/publish.
- Deployment runbook and support checklist.
-
-Exit criteria:
-
- Pilot team reaches >=30 accepted tasks in 30 days.
- >=5 reusable skills created.
- 0 critical incidents.
- Deployment under 2 hours on fresh host.
-
-### Release 2: Governance And Reuse Release
-
-Scope:
-
- Evidence narrative.
- Memory Control Center.
- Skill replay/eval governance.
- Admin health console.
- Connector policy hardening.
- Pilot scorecard.
-
-Exit criteria:
-
- Reviewers understand evidence.
- Users can inspect and control memory.
- Admins can diagnose provider/connector/runtime issues.
- Skill reuse is visible in metrics.
-
-### Release 3: Expansion Release
-
-Scope:
-
- Team/workspace concepts if validated.
- More connectors.
- Audit export.
- Cross-instance analytics.
- Policy profiles.
- Instance lifecycle automation.
-
-Exit criteria:
-
- Multiple teams can run without high support load.
- Governance story supports enterprise buying process.
-
-## Open Questions
-
- Is the first paying segment project teams, operations teams, engineering/support, or internal AI platform teams?
- Should Beaver optimize for single-user instances first or team workspaces sooner?
- Which connector is the safest and most valuable pilot connector?
- What exact tool policy should apply in customer pilots?
- What memory behavior should be on by default?
- How much raw evidence should normal users see versus admins?
- What is the backup/restore SLA for app instances?
-
-## Success Review Checklist
-
- Can a new user get to first accepted task quickly?
- Can a reviewer understand what the Agent did?
- Can an admin recover from provider or connector errors?
- Can a successful task become a reusable skill?
- Can a pilot owner prove value with metrics?
- Can security explain the deployment and tool boundaries?
--- a/docs/product-discovery/beaver/README.md
+++ b/docs/product-discovery/beaver/README.md
@ -1,30 +1,50 @@
-# Beaver Product Discovery
+# Beaver Standalone App Instance

-This folder covers Beaver as the whole product, not only one feature.
+This branch narrows Beaver to a clean standalone app instance that an external orchestrator can deploy.

-Beaver is an enterprise Agent sandbox and execution platform. It combines private deployment, per-user app instances, chat-to-task execution, task evidence, user acceptance, files, tools, skills, memory, connectors, scheduled work, and governance.
+## Product Boundary

-## Documents
+The app instance provides:

- [Business Strategy HTML](./index.html): business-style product discovery, strategy canvas, target users, segmentation, and competitors.
- [Product PRD HTML](./product-prd.html): product PRD, outcome roadmap, module job stories, WWA backlog items, and test scenarios.
- [Product Discovery Report](./product-discovery-report.md): product understanding, users, JTBD, opportunities, assumptions, experiments, priorities, metrics, and 30/90 day recommendations.
- [Product Architecture Brief](./product-architecture-brief.md): product-facing architecture across auth, deployment control, routing, app instances, frontend, backend, Agent runtime, tools, skills, memory, files, connectors, and operations.
- [PRD](./PRD-beaver-agent-sandbox.md): full-product PRD for the Beaver Agent Sandbox.
- [Validation Plan](./validation-plan.md): customer, product, technical, security, usability, and business validation plan.
- [Launch And Maintenance Runbook](./launch-maintenance-runbook.md): launch phases, readiness checks, monitoring, incident response, maintenance cadence, and rollback.
+- Chat and task workspace
+- Files, tools, skills, memory, schedules, and runtime pages
+- Backend API and WebSocket access behind the same origin
+- Keycloak SSO login with Authorization Code Flow + PKCE
+- JWT-based user identity using Keycloak `sub`
+
+The app instance does not provide:
+
+- Local registration or password login
+- User ID lifecycle management
+- Per-user instance creation
+- Hostname routing
+- Deployment control-plane APIs
+- Keycloak client provisioning
+
+## External Responsibilities
+
+The external orchestrator owns:
+
+- Container lifecycle
+- Public URL, TLS, reverse proxy, and port mapping
+- Data volume provisioning
+- `config.json` provisioning
+- Keycloak redirect URI and web origin registration
+- Multi-instance or tenant mapping, if needed later
+
+## Current SSO Values
+
+```text
+issuer:       https://keycloak.bwgdi.com/realms/beaver
+client_id:    beaver-agnet
+web_origin:   http://172.19.0.245:18080
+redirect_uri: http://172.19.0.245:18080/auth/callback
+post_logout_redirect_uri: http://172.19.0.245:18080/logout/callback
+```

 ## Source Material

 - [Project README](../../../README.md)
- [Deployment Guide](../../../部署指南.md)
- [Domain Guide](../../../域名配置指引.md)
 - [App Instance README](../../../app-instance/README.md)
 - [Backend README](../../../app-instance/backend/README.md)
- [Recent Backend Features](../../../projcet_review/backend_recent_completed_features.md)
 - [UI/UX Page Docs](../../ui-ux/README.md)
- [Customer Presentation](../../presentations/skill-replay-eval/index.html)
-
-## Related Feature Discovery
-
- [Skill Replay Eval Discovery](../skill-replay-eval/README.md)
--- a/docs/product-discovery/beaver/index.html
+++ b/docs/product-discovery/beaver/index.html
--- a/docs/product-discovery/beaver/launch-maintenance-runbook.md
+++ b/docs/product-discovery/beaver/launch-maintenance-runbook.md
@ -1,455 +0,0 @@
-# Beaver Launch And Maintenance Runbook
-
-Date: 2026-06-09
-
-Scope: whole Beaver product.
-
-## 1. Launch Principle
-
-Launch Beaver through controlled pilots before broad rollout.
-
-The product has a wide operational surface: auth, deployment control, routing, per-instance app containers, model providers, Agent runtime, tools, files, skills, memory, scheduled work, and connectors. A successful launch depends as much on reliability and trust as on feature completeness.
-
-## 2. Launch Roles
-
-| Role | Responsibility |
-| --- | --- |
-| Launch owner | Owns readiness, go/no-go, rollout phases |
-| Deployment owner | Owns Docker images, network, router, instance lifecycle |
-| Backend owner | Owns Agent runtime, tasks, tools, skills, cron, APIs |
-| Frontend owner | Owns user-facing flows and UI verification |
-| Security owner | Owns control-plane exposure, data boundaries, tool/connector policy |
-| Pilot owner | Owns user onboarding, workflow selection, feedback, metrics |
-| Support owner | Owns incident triage, runbook updates, user support |
-
-## 3. Launch Phases
-
-### Phase 0: Local Internal Readiness
-
-Audience: builders and internal testers.
-
-Goals:
-
- Full local deployment works.
- Core demo flows are stable.
- Known risks are documented.
-
-Required flows:
-
- Register/login.
- Provider onboarding.
- First chat response.
- Chat-to-task.
- Task acceptance/revision.
- File upload/preview/download/delete.
- Skill list/candidate/draft/review.
- Settings/status/restart.
-
-Exit criteria:
-
- Fresh deployment run completed from docs.
- No P0 or launch-blocking P1 issues.
- Demo script works end to end.
-
-### Phase 1: Controlled Pilot
-
-Audience: one internal team or one trusted customer team.
-
-Goals:
-
- Validate real workflow value.
- Validate deployment and support process.
- Validate trust, evidence, and governance story.
-
-Constraints:
-
- Narrow workflow scope.
- Narrow connector scope.
- Conservative tool policy.
- Human review for skill publishing.
- No opaque memory use for sensitive data.
-
-Exit criteria:
-
- >=30 accepted tasks in 30 days.
- >=2 recurring workflows.
- 0 critical incidents.
- Deployment/support issues documented and reduced.
-
-### Phase 2: Expanded Pilot
-
-Audience: more users in same team or a second pilot team.
-
-Goals:
-
- Test repeatability across workflows.
- Introduce Memory Control Center or stricter memory policy if ready.
- Strengthen skill reuse and scheduled work.
-
-Exit criteria:
-
- Skill reuse becomes visible.
- Admin can operate without developer pairing for common tasks.
- Evidence and report quality are accepted by workflow owner.
-
-### Phase 3: Production Candidate
-
-Audience: broader customer or department rollout.
-
-Goals:
-
- Stabilized deployment.
- Health monitoring.
- Incident response.
- Backup/restore process.
- Policy profiles.
-
-Exit criteria:
-
- Launch owner, security owner, and deployment owner approve.
- Support process has clear ownership.
- Rollback and restore are rehearsed.
-
-## 4. Pre-Launch Checklist
-
-### Deployment
-
- [ ] Images build successfully.
- [ ] Docker network exists.
- [ ] Router proxy starts.
- [ ] AuthZ service starts.
- [ ] Deploy control starts.
- [ ] Auth portal starts.
- [ ] App instance can be created.
- [ ] App instance route works through router proxy.
- [ ] Provider config can be written and instance restarted.
- [ ] Runtime directories are persistent.
- [ ] Public exposure limited to intended services.
-
-### Product Flows
-
- [ ] Register/login works.
- [ ] Provider onboarding works.
- [ ] Chat workbench loads.
- [ ] Task creation works.
- [ ] Task detail timeline works.
- [ ] Acceptance/revision/abandon works.
- [ ] Files page works.
- [ ] Tools page works for pilot tools.
- [ ] Skills page works.
- [ ] Marketplace install works if included.
- [ ] Cron/scheduled flow works if included.
- [ ] Connector flow works if included.
- [ ] Settings/status/logs work.
-
-### Governance
-
- [ ] Tool policy for pilot is documented.
- [ ] Connector side effects are understood.
- [ ] Skill publish gates are documented.
- [ ] Memory behavior is documented.
- [ ] Data retention expectations are documented.
- [ ] User-facing limitations are documented.
-
-### Support
-
- [ ] Pilot support channel exists.
- [ ] Incident owner assigned.
- [ ] Logs and health checks are accessible.
- [ ] Backup/restore expectations are clear.
- [ ] Known issues list exists.
-
-## 5. Monitoring
-
-### Product Metrics
-
-| Metric | Owner | Cadence |
-| --- | --- | --- |
-| Accepted tasks | Pilot owner | Weekly |
-| Acceptance rate | Product owner | Weekly |
-| Revision rate | Product owner | Weekly |
-| Active workflows | Pilot owner | Weekly |
-| Skill candidates and reuse | Product owner | Weekly |
-| Scheduled run success | Backend owner | Weekly |
-| Time to first accepted task | Product/design | Per onboarding |
-
-### Operational Metrics
-
-| Metric | Owner | Alert |
-| --- | --- | --- |
-| Instance creation failures | Deployment owner | >10% during pilot |
-| Router route failures | Deployment owner | Any repeated failure |
-| Provider setup failures | Support owner | >20% of onboarded users |
-| Task run failures | Backend owner | >20% for 2 days |
-| WebSocket/runtime disconnects | Backend/frontend | Repeated user-visible failures |
-| File operation failures | Backend owner | Any permission/path issue |
-| Tool execution failures | Backend owner | Repeated by tool category |
-| Cron failures | Backend owner | Any critical scheduled workflow missed |
-| Connector failures | Integration owner | Failed auth or unintended write |
-
-### Security Metrics
-
-| Metric | Alert |
-| --- | --- |
-| Control-plane public exposure | Immediate P0 |
-| Cross-instance data access | Immediate P0 |
-| Unintended external write | Immediate P0 |
-| Credential leak in logs/report | Immediate P0 |
-| Unsafe skill publish | P1, or P0 if external action risk |
-
-## 6. Health Checks
-
-### Control Plane
-
- Auth portal reachable.
- AuthZ service reachable internally.
- Deploy control reachable internally with token.
- Router proxy has generated routes.
- Instance registry is readable and consistent.
-
-### App Instance
-
- Frontend loads.
- Backend `/api/status` responds.
- WebSocket works.
- Provider config present.
- Workspace path mounted.
- Initial skills present.
- Logs accessible.
-
-### Product Runtime
-
- Chat request succeeds.
- Task run succeeds.
- File API succeeds.
- Tool registry loads.
- Skills list loads.
- Cron scheduler active if enabled.
- Connector status loads if enabled.
-
-## 7. Incident Response
-
-### P0: Control Plane Exposed
-
-Examples:
-
- `deploy-control` accessible from public internet.
- `authz-service` accessible from public internet.
- Internal token leaked.
-
-Actions:
-
-1. Remove public route/firewall exposure.
-2. Rotate affected tokens.
-3. Review access logs.
-4. Confirm no unauthorized instance operations.
-5. Update deployment checklist.
-
-### P0: Cross-Instance Data Leak
-
-Examples:
-
- Instance A reads Instance B workspace.
- Router sends user to wrong instance.
- Shared connector callback writes to wrong instance.
-
-Actions:
-
-1. Disable affected route or instance.
-2. Preserve logs and registry.
-3. Identify path/host/callback mapping failure.
-4. Patch and add regression test.
-5. Notify affected stakeholders.
-
-### P0: Unintended External Action
-
-Examples:
-
- Email or IM message sent unexpectedly.
- Calendar invite created unexpectedly.
- External system updated without user intent.
-
-Actions:
-
-1. Disable connector or tool.
-2. Preserve task/tool evidence.
-3. Identify initiating task, tool, arguments, user, connector account.
-4. Patch policy or confirmation gate.
-5. Add test case and update pilot policy.
-
-### P1: New User Cannot Reach Instance
-
-Actions:
-
-1. Check auth portal logs.
-2. Check authz register flow.
-3. Check deploy-control register/configure flow.
-4. Check instance registry.
-5. Check router route generation.
-6. Check container state and app logs.
-
-### P1: Provider Config Broken
-
-Actions:
-
-1. Check settings/status.
-2. Confirm config path and provider fields.
-3. Test provider credentials.
-4. Restart instance if config was changed.
-5. Improve onboarding copy if user error.
-
-### P1: Task Runtime Failing
-
-Actions:
-
-1. Check backend logs.
-2. Check provider availability.
-3. Check tool registry.
-4. Check task event timeline.
-5. Reproduce with minimal chat request.
-6. Mark affected pilot workflow as paused if repeated.
-
-### P2: UI Flow Confusing
-
-Actions:
-
-1. Record screen and user quote.
-2. Add to UX issue list.
-3. Determine whether it blocks pilot success.
-4. Fix copy/layout if low effort.
-
-## 8. Maintenance Cadence
-
-### Daily During Pilot
-
- Check critical incidents.
- Check instance health.
- Check failed task runs.
- Check support channel.
- Review provider/connector errors.
-
-### Weekly
-
- Review accepted tasks and acceptance rate.
- Review workflow success/failure.
- Review skill candidates and reuse.
- Review deployment issues.
- Review security/tool/connector events.
- Update known issues and runbook.
-
-### Monthly
-
- Rehearse fresh deployment.
- Review backup/restore approach.
- Review memory and skill governance.
- Review connector roadmap.
- Review pilot ROI and expansion decision.
-
-### Quarterly
-
- Revisit product positioning.
- Revisit architecture scaling assumptions.
- Decide team workspace / RBAC roadmap.
- Review security model and policy profiles.
-
-## 9. Backup And Restore
-
-Minimum data to preserve:
-
- `authz-service/runtime/data`
- `app-instance/runtime/instances`
- `app-instance/runtime/registry`
- `router-proxy/runtime/conf.d`
-
-Per instance:
-
- `beaver-home/config.json`
- `beaver-home/web_auth_users.json`
- `beaver-home/workspace/`
- skill and runtime state under instance data.
-
-Pilot requirements:
-
- Document manual backup command.
- Document manual restore procedure.
- Test restore for at least one non-production instance before expanded pilot.
-
-## 10. Change Management
-
-Before changing any of these, require launch owner review:
-
- Routing/proxy config.
- AuthZ issuer/internal URL.
- Deploy token names or values.
- Instance registry format.
- Workspace mount paths.
- Provider config schema.
- Tool execution policy.
- Connector callback routing.
- Skill publish gates.
- Memory default behavior.
-
-## 11. Rollback
-
-Rollback options:
-
- Roll back frontend/backend image for app instances.
- Disable specific connector.
- Disable scheduled job execution.
- Disable skill learning worker.
- Disable skill publish.
- Fall back to chat-only mode for affected workflow.
- Remove public route to affected instance.
- Restore instance data from backup.
-
-Rollback triggers:
-
- P0 incident.
- Repeated instance creation failure.
- Repeated task runtime failure blocking pilot work.
- Provider config issue affecting most users.
- Connector side-effect risk.
- UI issue blocking first accepted task.
-
-## 12. Launch Communication
-
-### Internal
-
-Beaver is launching as a controlled Agent execution pilot. The launch goal is not maximum feature breadth. The goal is to prove repeatable AI-assisted work with task acceptance, evidence, and reuse.
-
-### Pilot Users
-
-Use Beaver for selected workflows where you need a concrete output. Review each result. Accept it if usable, request revision if it is close, or abandon it if it is not worth continuing. Your feedback is the signal that helps Beaver improve and reuse work.
-
-### Admins
-
-Treat Beaver as an app platform with a control plane and per-instance runtime. Keep deploy-control and authz private. Monitor instance health, provider config, tool behavior, and connector side effects.
-
-## 13. Known Limitations To Disclose
-
- Memory is not yet fully productized with user controls.
- Connector maturity varies by provider.
- The first pilot should use a narrow set of workflows.
- Some operations may still require engineering support.
- Skill learning needs human review before publish.
- Multi-user organization features are not the first pilot focus.
-
-## 14. Go / No-Go Criteria
-
-Go if:
-
- Fresh deployment works.
- First accepted task flow works.
- Evidence timeline is readable enough for pilot.
- Tool and connector policy is documented.
- Support owner is assigned.
- No critical security issue is open.
-
-No-go if:
-
- Control-plane exposure risk is unresolved.
- Cross-instance isolation is unverified.
- Provider onboarding fails for most users.
- Task runtime is unreliable.
- Pilot workflow is not defined.
- No one owns incidents or support.
--- a/docs/product-discovery/beaver/product-architecture-brief.md
+++ b/docs/product-discovery/beaver/product-architecture-brief.md
@ -1,439 +0,0 @@
-# Beaver Product Architecture Brief
-
-Date: 2026-06-09
-
-Audience: product, engineering, delivery, security, and pilot stakeholders.
-
-## 1. Architecture Summary
-
-Beaver is built as a private-deployable, multi-instance Agent workspace.
-
-At the top level, it has five deployment components:
-
-```text
-Browser
-  -> auth-portal
-  -> authz-service
-  -> deploy-control
-  -> router-proxy
-  -> app-instance
-```
-
-Each `app-instance` contains the user-facing product:
-
-```text
-app-instance container
-  -> Nginx
-  -> Next.js frontend
-  -> Beaver backend
-  -> mounted beaver-home
-       -> config
-       -> workspace
-       -> skills
-       -> runtime data
-```
-
-The key product architecture choice is instance-level sandboxing. Each user or team can receive a separate app instance with its own config, workspace, files, skills, and runtime data.
-
-## 2. Product-Level System Map
-
-```text
-Auth and onboarding
-  auth-portal
-    -> register/login
-    -> model provider onboarding
-  authz-service
-    -> account and backend identity
-  deploy-control
-    -> create/configure/remove app-instance
-  router-proxy
-    -> route instance host to app-instance container
-
-User workspace
-  app-instance/frontend
-    -> chat workbench
-    -> tasks
-    -> files
-    -> skills
-    -> marketplace
-    -> MCP/tools
-    -> notifications/cron
-    -> connectors
-    -> settings/status/logs
-
-Agent runtime
-  app-instance/backend
-    -> interfaces
-    -> services
-    -> engine
-    -> coordinator
-    -> tools
-    -> skills
-    -> memory
-    -> integrations
-```
-
-## 3. Deployment Components
-
-### Auth Portal
-
-Responsibility:
-
- User login and registration entry.
- Provider onboarding after registration.
- Handoff into the user app instance.
-
-Product value:
-
- Gives non-technical users a clean entry point.
- Separates account onboarding from the per-instance app.
-
-Key risk:
-
- Provider configuration must be understandable and recoverable for non-engineer users.
-
-### AuthZ Service
-
-Responsibility:
-
- Account and backend identity orchestration.
- Internal token-protected coordination.
-
-Product value:
-
- Centralizes identity relationships between portal and app backends.
-
-Key risk:
-
- Misconfigured issuer/internal URL can break new app instances.
-
-### Deploy Control
-
-Responsibility:
-
- Create, configure, and manage app instances.
- Call `app-instance/create-instance.sh`.
- Write provider config and restart instance when needed.
-
-Product value:
-
- Makes private instance provisioning repeatable.
-
-Key risk:
-
- Must not be exposed publicly.
- Needs health checks and lifecycle operations for pilot scale.
-
-### Router Proxy
-
-Responsibility:
-
- Route hostnames to the correct app instance container.
-
-Product value:
-
- Lets each instance have a stable public URL.
-
-Key risk:
-
- Domain, wildcard DNS, HTTPS, and route reload errors can block access.
-
-### App Instance
-
-Responsibility:
-
- The user-facing Beaver workspace.
- Runs frontend, backend, and Nginx in one container.
- Mounts the instance's `beaver-home` as config and workspace boundary.
-
-Product value:
-
- Provides practical sandboxing for early private deployments.
-
-Key risk:
-
- Instance lifecycle, backup, restore, and resource limits need productized operations.
-
-## 4. App Instance Product Modules
-
-### Frontend Modules
-
-| Module | Route | Product Job |
-| --- | --- | --- |
-| Chat workbench | `/` | Main workspace for conversation, attachments, task cards, and acceptance |
-| Tasks | `/tasks`, `/tasks/[taskId]` | Track ordinary and scheduled task lifecycle, timeline, evidence, artifacts |
-| Notifications | `/notifications` | Review proactive or scheduled outputs |
-| Cron | `/cron` | Manage scheduled jobs |
-| Files | `/files` | Browse, upload, preview, download, delete workspace files |
-| Skills | `/skills` | Manage published skills, candidates, drafts, safety/eval, review, publish |
-| Marketplace | `/marketplace` | Discover and install skills |
-| MCP/tools | `/mcp` | Manage tool servers, tool details, test, add, edit, delete |
-| Agents | `/agents` | Manage Agent definitions and roles |
-| Outlook/connectors | `/outlook`, settings connector panels | Connect external systems |
-| Settings/status/logs | `/settings`, `/status`, `/logs` | Configure providers, runtime, channels, health, and debugging |
-
-### Backend Modules
-
-| Module | Responsibility |
-| --- | --- |
-| `foundation` | Shared config, errors, events, utilities, base models |
-| `engine` | Unified Agent runtime used by main Agent and sub-agents |
-| `coordinator` | Multi-agent sequence/parallel/DAG execution |
-| `tools` | Built-in and MCP tool registration/execution |
-| `skills` | Skill loading, resolution, drafts, learning, review, publish |
-| `memory` | Long-term memory and run/skill stores |
-| `permissions` | Governance and policy surface |
-| `services` | Application orchestration, tasks, cron, process projection |
-| `interfaces` | Web, CLI, Gateway, channels, MCP servers |
-| `integrations` | AuthZ, MCP, external protocols, connector clients |
-
-## 5. Core Product Flows
-
-### Flow A: New User Registration And First Workspace
-
-```text
-Browser
-  -> auth-portal register
-  -> authz-service /portal/register
-  -> deploy-control /api/instances/register
-  -> create app-instance container
-  -> app-instance backend registers user/backend
-  -> provider onboarding
-  -> deploy-control configures provider
-  -> user enters app-instance URL
-```
-
-Product requirements:
-
- Clear success/failure state during provisioning.
- Provider setup can be skipped but instance must explain missing model config later.
- Internal control-plane endpoints stay private.
-
-### Flow B: Chat To Managed Task
-
-```text
-User message
-  -> chat workbench
-  -> backend task router
-  -> ordinary chat or task mode
-  -> task created
-  -> Agent execution
-  -> tool calls and artifacts
-  -> task timeline
-  -> user accepts / asks revision / abandons
-```
-
-Product requirements:
-
- The user must understand when a message became a task.
- The task must be recoverable from chat, task list, and details page.
- Acceptance feedback must influence future learning.
-
-### Flow C: Complex Task With Agent Team
-
-```text
-Task request
-  -> TaskExecutionPlanner
-  -> ExecutionGraph
-       -> sequence / parallel / DAG nodes
-  -> TaskSkillResolver binds skills or ephemeral guidance
-  -> LocalAgentRunner executes nodes
-  -> main Agent synthesizes final answer
-  -> evidence saved
-```
-
-Product requirements:
-
- Team execution should be visible without overwhelming users.
- Failed subtasks should be diagnosable.
- Final synthesis should cite or summarize subtask evidence.
-
-### Flow D: Skill Learning Loop
-
-```text
-Accepted task
-  -> skill learning candidate
-  -> draft synthesis
-  -> safety report
-  -> eval report
-  -> human review
-  -> publish
-  -> future skill retrieval
-```
-
-Product requirements:
-
- Only accepted or otherwise high-signal work should become skill candidates.
- Publishing requires review and gates.
- Skill quality must be traceable over versions.
-
-### Flow E: File And Tool Work
-
-```text
-User uploads file or Agent needs file/tool
-  -> workspace file API or tool registry
-  -> Agent tool execution
-  -> result returned to context
-  -> event/evidence saved
-  -> artifact available in task or files
-```
-
-Product requirements:
-
- User-visible file roots must stay simple.
- Tool calls must be recorded.
- Dangerous tools need policy and review.
-
-### Flow F: Scheduled Work And Notifications
-
-```text
-User creates scheduled job
-  -> cron service stores job
-  -> scheduled run triggers task/notification
-  -> user reviews output
-  -> output can become normal task continuation
-```
-
-Product requirements:
-
- Scheduled outputs need the same acceptance path as manual tasks.
- Failed scheduled runs need alerts and retry/recovery.
-
-### Flow G: External Connectors
-
-```text
-Connector setup
-  -> channel/connector config
-  -> sidecar or external provider
-  -> inbound event or outbound action
-  -> Beaver task/runtime
-  -> response or notification
-```
-
-Product requirements:
-
- External writes need clear user/admin control.
- Connector onboarding must show state, errors, and reconnect steps.
- Multi-instance callback routing must be explicit.
-
-## 6. Governance Boundaries
-
-### Instance Boundary
-
-Each app instance owns:
-
- `config.json`
- `web_auth_users.json`
- `workspace/`
- skills and runtime state
- provider configuration
-
-Risk:
-
- Cross-instance leakage would be a critical incident.
-
-### Control Plane Boundary
-
-Public exposure should be limited to:
-
- Auth portal.
- Router proxy for app instances.
-
-Do not expose:
-
- `deploy-control`.
- `authz-service`.
-
-### Tool Boundary
-
-Tools are the action surface. Policy should distinguish:
-
- Read-only tools.
- Workspace-scoped write tools.
- External write tools.
- Destructive tools.
- Credential/permission/payment tools.
-
-### Skill Boundary
-
-Skills guide Agent behavior and tool use. Publishing a bad skill can create repeated bad behavior. Skill publishing therefore needs:
-
- Candidate quality signal.
- Safety report.
- Eval/replay evidence where possible.
- Human review.
- Version rollback.
-
-### Memory Boundary
-
-Memory creates long-term product value but also trust risk. Productization should include:
-
- Source.
- Confidence.
- Last used.
- Edit/delete/freeze controls.
- Task evidence showing when memory was used.
-
-## 7. Architecture Maturity
-
-| Area | Maturity | Notes |
-| --- | --- | --- |
-| Multi-instance deployment | Pilot-ready | Needs lifecycle and health automation |
-| Chat workbench | Pilot-ready | UI docs show tested states |
-| Task lifecycle | Strong | Core product loop exists |
-| Task evidence | Strong foundation | Needs narrative/summary layer |
-| Agent team | Functional | Needs product explanation and failure UX |
-| Files | Pilot-ready | UI docs show tested workflows |
-| Tools/MCP | Functional | Needs policy hardening and admin clarity |
-| Skills | Functional | Needs stronger quality gates and reuse metrics |
-| Memory | Backend foundation | Needs visible product controls |
-| Scheduled work | Basic product capability | Needs stability and clearer run handling |
-| Connectors | Mixed maturity | Need pilot-safe connector list |
-| Operations | Basic | Needs health console, backup/restore, runbook |
-
-## 8. Architecture Risks
-
-| Risk | Severity | Mitigation |
-| --- | --- | --- |
-| Control-plane service exposed publicly | Critical | Deployment checks and docs; firewall/proxy validation |
-| Instance data leakage | Critical | Path isolation tests, authz tests, MinIO/user-files policy checks |
-| Tool side effects without review | High | Tool policy profiles, evidence logs, connector sandbox |
-| Provider misconfiguration blocks first use | High | Onboarding checks and settings diagnostics |
-| Product surface becomes hard to operate | High | Admin health console and staged pilot scope |
-| Memory trust gap | High | Memory control center before broad memory activation |
-| Skill quality drift | High | Safety/eval/replay and publish gates |
-
-## 9. Recommended Architecture Roadmap
-
-### Next 30 Days
-
- Rehearse clean deployment and record missing steps.
- Add pilot health checklist for auth portal, authz, deploy control, router, and app instance.
- Define pilot-safe tools and connectors.
- Add task evidence narrative summary.
- Track accepted task, skill candidate, and skill reuse events.
-
-### Next 90 Days
-
- Memory Control Center MVP.
- Admin Health Console MVP.
- Instance suspend/resume/backup/restore runbook or tooling.
- Connector sandboxing and side-effect policy.
- Skill replay/eval as part of skill governance.
- Organization/team-level roadmap decision.
-
-## 10. Product Architecture Principle
-
-Beaver should keep its product architecture centered on controlled Agent work:
-
-```text
-private workspace
-  + task lifecycle
-  + tool/file execution
-  + evidence
-  + acceptance
-  + skill/memory reuse
-  + operational governance
-```
-
-New features should strengthen this loop. Features that do not improve execution, evidence, acceptance, reuse, or governance should be treated as secondary until the pilot motion is proven.
--- a/docs/product-discovery/beaver/product-discovery-report.md
+++ b/docs/product-discovery/beaver/product-discovery-report.md
@ -1,494 +0,0 @@
-# Beaver Product Discovery Report
-
-Date: 2026-06-09
-
-Product stage: existing product
-
-Scope: the whole Beaver product, including deployment, runtime, UI, Agent execution, tasks, files, tools, skills, memory, connectors, scheduled work, governance, validation, launch, and maintenance.
-
-## Executive Summary
-
-Beaver is an enterprise Agent sandbox and execution platform. Its product promise is to move AI from "chat that gives answers" to "controlled Agent work that creates deliverables, records evidence, asks for acceptance, and turns accepted work into reusable capability."
-
-The strongest product wedge is not another chatbot UI. It is the full execution loop:
-
-```text
-user request
-  -> task recognition
-  -> Agent/team execution
-  -> tool and file work
-  -> evidence timeline
-  -> user acceptance or revision
-  -> skill and memory learning
-  -> future reuse
-```
-
-The current codebase already supports major parts of this loop: multi-instance Docker deployment, auth portal, app instances, chat workbench, task center, task details, user acceptance, files, tools, skills, skill learning, marketplace, settings, connectors, scheduled jobs, and backend Agent team orchestration. The next product challenge is packaging these capabilities into a clear buyer story, validating the highest-value use cases, hardening operational reliability, and making governance understandable to non-engineer stakeholders.
-
-Recommended product strategy:
-
-1. Position Beaver as "enterprise Agent execution and governance," not as a general AI chat app.
-2. Focus first on repeatable knowledge work that is high-frequency, cross-tool, evidence-sensitive, and review-heavy.
-3. Treat task acceptance, evidence, skills, and memory as the core product loop.
-4. Productize deployment and operations enough for pilots before broad feature expansion.
-5. Validate value through real workflows, not opinions about AI.
-
-## Product Summary
-
-### Product Description
-
-Beaver is a private-deployable Agent workspace for teams that need AI to perform work, not only answer questions. A user can chat, upload files, trigger tasks, review execution evidence, accept or revise results, manage tools, install or publish skills, configure model providers, connect external systems, and run scheduled work.
-
-### Target Users
-
-| Segment | Primary Need | Why Beaver Fits |
-| --- | --- | --- |
-| Enterprise AI platform owner | Provide controlled Agent capability to teams | Private deployment, per-instance boundaries, tools, skills, governance |
-| Knowledge workflow team | Finish recurring multi-step work faster | Task execution, files, tools, acceptance, scheduled work |
-| Project / delivery team | Produce and revise deliverables with traceability | Task timeline, artifacts, evidence, revision loop |
-| Engineering / support team | Use AI with files, commands, logs, and review | Tool execution, task evidence, multi-agent planning |
-| Operations / admin | Configure models, users, connectors, and instances | Auth portal, deploy control, settings, status, logs |
-| Skill owner / reviewer | Turn successful work into reusable methods | Skill candidates, drafts, safety/eval reports, review, publish |
-
-### Current Feature Map
-
-| Domain | Current State | Product Meaning |
-| --- | --- | --- |
-| Auth and onboarding | Auth portal, register/login, model provider onboarding | Users can enter a controlled workspace |
-| Multi-instance deployment | Deploy control creates isolated app-instance containers; router proxy routes by host | Enables per-user or per-team sandboxing |
-| Chat workbench | Conversations, attachments, task cards, current task progress, acceptance controls | Main user workspace |
-| Task runtime | Auto task recognition, task creation, runs, timeline, status, acceptance | Converts chat into managed work |
-| Agent execution | Unified engine, main agent, sub-agent/team execution, sequence/parallel/DAG coordinator | Handles complex work beyond one response |
-| Tools | Built-in tools, MCP tools, tool management UI | Lets Agents act on files, web, terminal, integrations |
-| Files | Workspace file browser, upload, preview, download, delete | Gives AI and users a shared working surface |
-| Skills | Published skills, candidates, drafts, safety/eval, review, publish | Turns accepted work into reusable methods |
-| Marketplace | Skill discovery/install flow | Foundation for capability distribution |
-| Memory | Backend long-term memory foundation exists, product integration still incomplete | Future compounding personalization and organization knowledge |
-| Scheduled work | Cron jobs, notifications, scheduled task flows | Moves from reactive chat to proactive work |
-| Connectors | Outlook and external connector architecture; Feishu/Weixin-related sidecar paths | Brings Agent into real business channels |
-| Settings/status/logs | Provider config, agent config, channel config, runtime status, restart | Admin control and troubleshooting |
-
-### Current Value Proposition
-
-For enterprise teams:
-
-> Beaver provides a private Agent workspace where AI work is executed, tracked, reviewed, and reused. It gives teams the speed of AI assistance with the control needed for real business workflows.
-
-For product pilots:
-
-> Beaver is strongest when a team has recurring knowledge work that crosses files, tools, systems, and reviews.
-
-### Current Challenges
-
-| Challenge | Why It Matters |
-| --- | --- |
-| Product breadth is large | Buyers may not understand what to adopt first |
-| Memory is partly backend-ready but not fully productized | "越用越懂" story needs visible control |
-| Connector maturity varies by channel | Customer demos must avoid overpromising |
-| Multi-instance deployment is powerful but operationally sensitive | Pilot success depends on stable setup and clear runbooks |
-| Skill learning needs strong governance | Reuse can become risk if publishing is weak |
-| Metrics are not yet productized | Hard to prove pilot value without baseline and target |
-| Customer research is not yet captured | Current roadmap is inferred from implementation and product judgment |
-
-## User Segments
-
-### Segment 1: Enterprise AI Platform Owner
-
-They need to safely introduce Agent capability into an organization. Their concern is not whether an LLM can answer a question; it is whether teams can use it without losing control of data, tools, cost, and quality.
-
-### Segment 2: Workflow Owner
-
-They own a recurring process such as weekly reporting, project status, proposal drafting, research, operations follow-up, support triage, or document review. They want less manual coordination and more repeatable output.
-
-### Segment 3: Individual Knowledge Worker
-
-They want one workspace where they can chat, upload files, run tools, generate artifacts, and continue a task until the output is usable.
-
-### Segment 4: Admin / Operator
-
-They need to create instances, configure models, monitor status, debug logs, manage connectors, and keep deployment safe.
-
-### Segment 5: Skill Maintainer
-
-They curate reusable skills, review drafts, evaluate safety, publish stable versions, and prevent low-quality automation from spreading.
-
-## JTBD
-
-| User | Job Story | Current Alternative | Beaver Outcome |
-| --- | --- | --- | --- |
-| Platform owner | When teams ask for AI tools, I want a controlled Agent workspace so they can experiment without unmanaged SaaS sprawl | ChatGPT accounts, custom scripts, internal demos | Private, governed Agent workspace |
-| Workflow owner | When a recurring process takes many manual steps, I want AI to execute and track it so my team can review the result | Manual docs, spreadsheets, Slack/email coordination | Task with timeline, artifacts, acceptance |
-| Knowledge worker | When I ask AI to produce something, I want to revise and accept it as work, not just receive a message | Chat thread and copy/paste | Task lifecycle and deliverable loop |
-| Admin | When a user registers, I want a workspace created and routed automatically so onboarding is repeatable | Manual container setup | Auth portal + deploy control + router proxy |
-| Skill maintainer | When a task succeeds, I want to turn its method into a reusable skill so future tasks improve | Prompt docs, tribal knowledge | Skill candidate/draft/review/publish |
-| Security reviewer | When Agents use tools, I want evidence and boundaries so I can audit behavior | Opaque model/tool calls | Tool traces, task evidence, instance sandbox |
-
-## Opportunity Areas
-
-Opportunity scores are qualitative estimates from current docs and product context. They need validation with customer interviews and pilot data.
-
-| Opportunity | Importance | Current Satisfaction | Opportunity Score | Notes |
-| --- | ---: | ---: | ---: | --- |
-| I need AI outputs to become reviewable tasks, not loose chat replies | 0.95 | 0.30 | 0.67 | Core wedge |
-| I need evidence of what the Agent did | 0.90 | 0.35 | 0.59 | Governance driver |
-| I need repeatable workflows to become reusable skills | 0.85 | 0.40 | 0.51 | Learning moat |
-| I need private deployment and instance boundaries | 0.90 | 0.45 | 0.50 | Enterprise adoption |
-| I need AI to work across files, tools, and external systems | 0.85 | 0.45 | 0.47 | Workflow depth |
-| I need proactive scheduled work, not only reactive chat | 0.70 | 0.45 | 0.39 | Expansion opportunity |
-| I need memory that I can inspect and control | 0.80 | 0.25 | 0.60 | High future leverage |
-
-Top opportunities:
-
-1. Make AI work reviewable and acceptable.
-2. Make process evidence and governance visible.
-3. Turn accepted work into reusable skills and memory.
-
-## Product Positioning
-
-Recommended primary positioning:
-
-> Beaver is an enterprise Agent execution and governance platform for repeatable knowledge work.
-
-Supporting message:
-
-> It gives teams a private Agent sandbox where AI can use tools, manage files, execute tasks, record evidence, ask for acceptance, and learn reusable skills from approved work.
-
-Avoid positioning Beaver as:
-
- A generic chatbot.
- A pure model gateway.
- A standalone RPA replacement.
- A developer-only Agent framework.
- A marketplace-only skill product.
-
-## Competitive Frame
-
-| Category | Strength | Gap Beaver Addresses |
-| --- | --- | --- |
-| AI chat apps | Fast answers and content generation | Weak task lifecycle, evidence, acceptance, and reuse |
-| RPA / automation | Repeatable process execution | Rigid flows, harder natural-language adaptation |
-| Agent frameworks | Developer flexibility | Missing complete user workspace and governance surface |
-| Internal scripts | Fast local automation | Poor product UX, auditability, onboarding, and scaling |
-| Enterprise AI platforms | Governance and admin | Often weak on task-level execution and skill learning loop |
-
-## Product Ideas
-
-Generated from PM, design, and engineering perspectives.
-
-### PM Ideas
-
-1. Pilot Workflow Templates: package 3-5 high-value workflows such as weekly report, project brief, support triage, document review.
-2. Team Workspace Mode: group multiple users under one organization workspace with shared skills and controlled memory.
-3. Governance Scorecard: show evidence coverage, accepted tasks, skill reuse, failed runs, and tool risk.
-4. Skill Quality Lifecycle: strengthen candidate -> draft -> safety -> eval -> review -> publish -> version rollback.
-5. ROI Dashboard: measure time saved, accepted tasks, revision rounds, reusable skill adoption.
-
-### Design Ideas
-
-1. Work Inbox: unify tasks, scheduled runs, notifications, and pending reviews.
-2. Task Evidence Narrative: convert raw events into readable "what happened" timeline.
-3. Memory Control Center: show what Beaver remembers, why, source, confidence, and edit/delete controls.
-4. First-Run Product Tour: guide a new user from provider setup to first accepted task.
-5. Admin Health Console: one page for instance, provider, connector, queue, and runtime health.
-
-### Engineering Ideas
-
-1. Tenant/Workspace Policy Profiles: control allowed tools, connectors, memory behavior, and publish gates per deployment.
-2. Connector Sandbox Layer: test external channel actions without touching production systems.
-3. Unified Evidence Schema: normalize task, tool, artifact, skill, memory, and connector events.
-4. Replay-Based Skill Evaluation: evaluate skill drafts against historical accepted runs.
-5. Instance Lifecycle Automation: suspend, resume, backup, restore, rotate secrets, inspect health.
-
-Top 5 product ideas to pursue:
-
-| Rank | Idea | Why Selected | Assumptions |
-| ---: | --- | --- | --- |
-| 1 | Pilot Workflow Templates | Gives customers a concrete starting point | Initial buyers share common workflows |
-| 2 | Task Evidence Narrative | Makes governance understandable | Reviewers value readable evidence |
-| 3 | Memory Control Center | Unlocks long-term differentiation | Users trust memory if they can inspect/control it |
-| 4 | Governance Scorecard | Helps buyers justify adoption | Platform owners need measurable proof |
-| 5 | Instance Lifecycle Automation | Reduces pilot operational risk | Deployments will grow beyond a few instances |
-
-## Key Assumptions
-
-| Assumption | Category | Impact | Uncertainty |
-| --- | --- | ---: | ---: |
-| Enterprise teams feel enough pain with chat-only AI to adopt an Agent workspace | Value | High | Medium |
-| Task acceptance is a meaningful quality signal | Value | High | Medium |
-| Users will tolerate a task workflow instead of expecting instant chat only | Usability | High | Medium |
-| Per-instance deployment is operationally acceptable for early customers | Feasibility | High | Medium |
-| Workflow owners can identify repeatable tasks worth piloting | Value | High | Low |
-| Skill reuse creates visible productivity gains | Business Viability | High | High |
-| Memory control is required before customers trust long-term memory | Trust | High | Medium |
-| Connectors are necessary for customer stickiness | Value | Medium | Medium |
-| Admins can manage model provider configuration without heavy support | Usability | Medium | Medium |
-| The team can maintain broad product surface without quality drift | Team Capability | High | High |
-
-## Prioritized Assumptions
-
-### P0 Validate Immediately
-
-| Assumption | Why It Matters | What Could Go Wrong | Validation |
-| --- | --- | --- | --- |
-| Customers prefer task-based AI execution over chat-only for real work | Core product wedge | Users see tasks as overhead | Run 3 workflow pilots and compare chat-only vs task loop |
-| Evidence timeline increases trust | Governance story depends on it | Evidence is too technical or noisy | Reviewer usability test with task timelines |
-| Private multi-instance deployment is acceptable | Adoption depends on ops fit | Setup too fragile or expensive | Deploy pilot on fresh Linux host and measure time/errors |
-| Accepted tasks can generate reusable skills that users value | Learning loop depends on this | Skills are low quality or unused | Track reuse of skills from accepted pilot tasks |
-
-### P1 Important
-
-| Assumption | Why It Matters | Validation |
-| --- | --- | --- |
-| Memory control center is required before broad rollout | Trust and differentiation | Interview pilot admins and users |
-| Connectors drive retention | External systems make workflows real | Compare pilot workflows with and without Outlook/IM connectors |
-| Scheduled work creates high-value usage | Moves Beaver from reactive to proactive | Test weekly report and reminder workflows |
-| Marketplace/skill distribution is a buyer requirement | Scaling reuse across teams | Ask platform owners during procurement |
-
-### P2 Later
-
-| Assumption | Why It Matters | Validation |
-| --- | --- | --- |
-| Multi-user team workspace is required for first paid pilots | Could reshape architecture | Validate with buyer interviews |
-| Fine-grained per-tool policies are needed in UI | Admin complexity | Observe support requests |
-| Cross-instance organization analytics is needed early | Enterprise reporting | Validate after 2-3 pilots |
-
-## Opportunity Solution Tree
-
-Desired outcome:
-
-> Within 90 days, prove that a pilot team can complete repeatable AI-assisted work with acceptance, evidence, and reuse: at least 30 accepted tasks, 5 reusable skills, 2 recurring workflows, and 0 critical deployment/security incidents.
-
-```text
-Outcome: Trusted repeatable Agent work in pilot teams
-
-Opportunity 1: I need AI outputs to become reviewable deliverables.
-  Solution 1.1: Task lifecycle with acceptance and revision.
-    Experiment: Run a project brief workflow and measure accepted output rate.
-  Solution 1.2: Task details page with evidence narrative.
-    Experiment: Ask reviewers to reconstruct what happened from timeline.
-  Solution 1.3: Work Inbox for pending reviews and scheduled outputs.
-    Experiment: Fake-door navigation item and measure clicks/asks.
-
-Opportunity 2: I need confidence that Agent tool use is controlled.
-  Solution 2.1: Tool traces and artifact timeline.
-    Experiment: Security review of 5 real tasks.
-  Solution 2.2: Admin health and policy console.
-    Experiment: Operator performs setup/debug checklist on fresh instance.
-  Solution 2.3: Connector sandbox and side-effect journals.
-    Experiment: Test external send/reply flows in sandbox mode.
-
-Opportunity 3: I need successful work to become reusable.
-  Solution 3.1: Skill candidate -> draft -> review -> publish.
-    Experiment: Convert 5 accepted tasks into skills and track reuse.
-  Solution 3.2: Memory Control Center.
-    Experiment: Prototype memory review UI and test trust/comprehension.
-  Solution 3.3: Pilot workflow templates.
-    Experiment: Package 3 templates and measure first-task success rate.
-```
-
-## Validation Experiments
-
-| Assumption | Hypothesis | Experiment | Duration | Success Criteria |
-| --- | --- | --- | --- | --- |
-| Task loop beats chat-only | Users complete more usable work with task acceptance than plain chat | Same workflow performed in chat-only and Beaver task loop | 1 week | Beaver output accepted in fewer revision rounds |
-| Evidence creates trust | Reviewers can understand and audit what happened | Give 5 timelines to reviewers | 2 days | >=80% identify tools, artifacts, result, and risk |
-| Deployment is pilot-ready | Fresh host setup is repeatable | Deploy on clean Linux/WSL2 machine using docs | 1 day | Setup under 2 hours with no undocumented step |
-| Skills create reuse | Accepted tasks can become useful skills | Convert 5 pilot tasks into skills | 2 weeks | 3 skills reused at least twice |
-| Memory needs control UI | Users trust memory more with inspect/edit/delete | Clickable prototype or simple page | 3 days | >=80% say they would enable memory with controls |
-| Scheduled work matters | Recurring workflows create repeat usage | Weekly report or reminder pilot | 2-4 weeks | At least 2 recurring jobs run and get accepted outputs |
-
-## Feature Prioritization
-
-### Must Have
-
-| Feature | Impact | Effort | Risk | Reason |
-| --- | --- | --- | --- | --- |
-| Auth portal and instance onboarding | High | High | Medium | Required for any user to start |
-| Provider configuration flow | High | Medium | Medium | Model access is prerequisite |
-| Chat workbench | High | High | Medium | Primary user surface |
-| Task lifecycle and acceptance | High | High | Medium | Core differentiation |
-| Task timeline/evidence | High | High | Medium | Governance and review |
-| Files workspace | High | Medium | Medium | Most real workflows need files |
-| Tool management | High | Medium | High | Agents need controlled action surface |
-| Skills review/publish | High | High | High | Reuse loop |
-| Settings/status/logs | High | Medium | Medium | Operational support |
-| Basic deployment guide/runbook | High | Medium | Medium | Pilot readiness |
-
-### Should Have
-
-| Feature | Impact | Effort | Risk | Reason |
-| --- | --- | --- | --- | --- |
-| Pilot workflow templates | High | Medium | Low | Creates adoption path |
-| Evidence narrative layer | High | Medium | Medium | Makes audit readable |
-| Memory Control Center | High | High | Medium | Unlocks long-term trust |
-| Skill replay/eval hardening | High | High | High | Makes learning safer |
-| Scheduled workflow polish | Medium | Medium | Medium | Supports proactive use cases |
-| Connector onboarding polish | Medium | High | High | Needed for real systems |
-| Admin health console | Medium | Medium | Medium | Reduces support load |
-
-### Could Have
-
-| Feature | Reason |
-| --- | --- |
-| Multi-user organization workspace | Valuable, but changes scope and permissions |
-| Cross-instance analytics | Useful after multiple deployments |
-| Fine-grained policy UI | Need policy demand before UI complexity |
-| Audit export | Strong sales support, not first pilot blocker |
-| Cost/quality model router | Useful after usage volume grows |
-
-### Not Yet
-
-| Feature | Reason |
-| --- | --- |
-| Broad public SaaS launch | Product and ops need pilot hardening first |
-| Fully autonomous publish of skills | Human review should remain mandatory |
-| Production writes through connectors without review | Trust risk |
-| Complex enterprise RBAC before pilot validation | May overbuild before segment clarity |
-
-## Metrics Dashboard
-
-### North Star Metric
-
-Accepted Agent Workflows:
-
-> Number of AI-assisted tasks or scheduled workflows accepted by users per active pilot team per week.
-
-Why this metric: it captures real delivered value better than messages sent, tokens used, or model calls.
-
-### Input Metrics
-
-| Metric | Definition | Target For Pilot |
-| --- | --- | --- |
-| Task Creation Rate | Tasks created / active users / week | Increasing weekly |
-| Acceptance Rate | Accepted task runs / completed task runs | >=60% in pilot |
-| Revision Rate | Runs needing revision / completed runs | Track down over time |
-| Evidence Coverage | Task runs with timeline/tool/artifact evidence / task runs | >=90% |
-| Skill Candidate Rate | Accepted tasks producing candidates / accepted tasks | >=20% after week 2 |
-| Skill Reuse Rate | Runs activating published pilot skills / task runs | >=15% after skills exist |
-| Scheduled Success Rate | Accepted scheduled outputs / scheduled runs | >=50% for selected workflows |
-| Deployment Success Time | Fresh deployment time to first working user | <2 hours for pilot |
-
-### Guardrail Metrics
-
-| Metric | Alert |
-| --- | --- |
-| Critical tool/security incident | Any occurrence |
-| Instance creation failure rate | >10% in pilot |
-| Provider configuration failure rate | >20% |
-| Task run failure rate | >20% for 2 consecutive days |
-| Connector side-effect incident | Any unintended external write |
-| User file permission/storage incident | Any cross-user or cross-instance leak |
-| p95 task completion latency | Exceeds pilot workflow tolerance |
-
-### Business Metrics
-
- Pilot activation: teams reaching first accepted task.
- Time to first accepted task.
- Weekly active task users.
- Repeated workflow count.
- Skill reuse per team.
- Customer-reported time saved.
- Pilot conversion intent.
-
-## Customer Research Plan
-
-No direct interview transcripts were provided. Research should start immediately before locking roadmap.
-
-### Participants
-
- 5 knowledge workers with recurring document/report/research workflows.
- 3 workflow owners or team leads.
- 3 enterprise AI platform/admin stakeholders.
- 2 security or IT reviewers.
- 2 engineers/operators who would deploy and maintain Beaver.
-
-### Questions
-
- What recurring work is painful enough to delegate to an Agent?
- What would make an AI output "acceptable" instead of just "interesting"?
- What evidence do you need to trust Agent work?
- What systems must the Agent connect to for the workflow to matter?
- What would make you stop a pilot?
- What memory or reuse behavior feels helpful vs risky?
- What does a successful 30-day pilot need to prove?
-
-## Interview Guide
-
-### Opening
-
-We are studying how teams move AI from chat into real work. We are not asking whether you like an idea. We want examples of work you recently did.
-
-### Current Behavior
-
- Walk me through the last time you used AI for a real work deliverable.
- What happened after the AI gave an answer?
- What did you copy, edit, verify, or redo manually?
- Who reviewed the result?
-
-### Pain
-
- What was the slowest or most annoying part?
- What made the output hard to trust?
- What tools or files were involved?
- What evidence did you need but did not have?
-
-### Reuse
-
- Have you repeated a similar workflow since then?
- Did you reuse prompts, templates, scripts, or notes?
- What would make that reuse safe for a team?
-
-### Governance
-
- What AI actions would need approval?
- What data or tools should be off limits?
- Who needs to see the history of what happened?
-
-### Pilot
-
- Which one workflow would you test first?
- What result would make you expand usage?
- What failure would make you stop?
-
-## Recommended Next 30 Days
-
-1. Pick 2-3 pilot workflows: project brief, weekly report, document review, support triage, or file processing.
-2. Run fresh deployment rehearsal from README/deployment guide and record gaps.
-3. Define pilot metrics and instrument accepted tasks, revisions, skill candidates, skill reuse, and run failures.
-4. Create a task evidence narrative prototype on top of existing timeline data.
-5. Package pilot workflow templates as skills or documented demos.
-6. Validate provider onboarding with 3 non-engineer users.
-7. Run security review for file boundaries, tool execution, connectors, and deploy-control exposure.
-8. Decide which connector(s) are pilot-safe.
-
-## Recommended Next 90 Days
-
-1. Complete Memory Control Center MVP.
-2. Harden skill learning with replay/eval and publish gates.
-3. Add Admin Health Console for provider, instance, connector, task queue, and runtime status.
-4. Improve instance lifecycle: suspend, resume, backup, restore, rotate secrets.
-5. Add customer-facing pilot scorecard.
-6. Formalize tool/connector policy profiles.
-7. Expand pilot from one workflow to one department.
-8. Build audit export after evidence narrative stabilizes.
-
-## Biggest Risks
-
-| Risk | Severity | Mitigation |
-| --- | --- | --- |
-| Product appears too broad and hard to adopt | High | Lead with pilot workflows and task loop |
-| Deployment complexity blocks pilots | High | Rehearsed runbook, health checks, support checklist |
-| Agent actions cause unintended side effects | Critical | Conservative tool policy, explicit connector sandboxing, evidence logs |
-| Task evidence is too technical | High | Evidence narrative and reviewer testing |
-| Skill learning publishes poor methods | High | Human review, safety/eval, replay validation |
-| Memory feels creepy or uncontrollable | High | Memory control UI before broad enablement |
-| Team spreads effort across too many modules | High | Prioritize task loop, evidence, skills, deployment reliability |
-
-## Recommended Immediate Actions
-
-1. Reframe all main product docs around Beaver as an Agent execution and governance platform.
-2. Treat Skill Replay Eval as a subfeature under the skill governance loop.
-3. Build the next roadmap around pilot workflows, not isolated modules.
-4. Make accepted tasks the main success metric.
-5. Productize memory and evidence before adding many new connectors.
-6. Prove deployment repeatability before selling broad private deployments.
--- a/docs/product-discovery/beaver/product-prd.html
+++ b/docs/product-discovery/beaver/product-prd.html
--- a/docs/product-discovery/beaver/validation-plan.md
+++ b/docs/product-discovery/beaver/validation-plan.md
@ -1,378 +0,0 @@
-# Beaver Validation Plan
-
-Date: 2026-06-09
-
-Purpose: validate Beaver as a whole product before broader rollout.
-
-## 1. Validation Strategy
-
-Beaver should be validated through real workflows, not through opinions about AI.
-
-The validation sequence:
-
-```text
-customer problem
-  -> workflow fit
-  -> first-run onboarding
-  -> task execution
-  -> evidence comprehension
-  -> acceptance/revision
-  -> skill reuse
-  -> deployment and operations
-  -> security/governance
-```
-
-## 2. Validation Questions
-
-### Product Value
-
- Does Beaver solve a painful enough workflow problem?
- Does task acceptance make AI work feel more reliable?
- Do users complete more usable work than with chat-only AI?
- Does skill reuse save time after repeated workflows?
-
-### Usability
-
- Can users understand when chat becomes a task?
- Can users find task evidence and artifacts?
- Can users accept, revise, or abandon without confusion?
- Can admins configure providers and connectors without engineering help?
-
-### Technical Feasibility
-
- Can fresh deployments be created repeatably?
- Can app instances stay isolated?
- Can Agent tasks run reliably with files, tools, skills, and scheduled jobs?
- Can failures be diagnosed from status/logs/events?
-
-### Governance And Security
-
- Are control-plane services private?
- Are file and workspace boundaries enforced?
- Are tool calls recorded and reviewable?
- Are external connector writes controlled?
- Is memory inspectable and controllable before broad use?
-
-### Business Viability
-
- Does a pilot team have enough recurring workflows?
- Can the product produce measurable weekly value?
- Can an admin operate it with acceptable support load?
- Can the buyer justify expansion?
-
-## 3. Pilot Workflow Candidates
-
-| Workflow | Why It Fits | Required Capabilities | Success Signal |
-| --- | --- | --- | --- |
-| Weekly project report | Recurring, evidence-sensitive, review-heavy | scheduled work, files, task acceptance, artifacts | Report accepted weekly |
-| Project brief / proposal | Multi-step, document-heavy, revision-heavy | chat, files, tools, task timeline, revisions | Brief accepted after fewer rounds |
-| Document review | Clear deliverable and evidence need | files, task timeline, artifacts, acceptance | Review output accepted |
-| Support triage | Tool/context-heavy and repeatable | tasks, tools, memory, maybe connector | Triage summary accepted |
-| Research synthesis | Agent team fit, artifact-heavy | multi-agent, web/search, files, evidence | Synthesis accepted and reused |
-
-Recommended first pilot:
-
-1. Project brief or document review for manual task loop.
-2. Weekly project report for scheduled workflow.
-3. Skill reuse from the accepted outputs.
-
-## 4. Customer Discovery Validation
-
-### Participants
-
- 5 end users.
- 3 workflow owners.
- 3 admins/platform owners.
- 2 security reviewers.
- 2 operators/engineers.
-
-### Method
-
- 45-minute interviews using past-behavior questions.
- 60-minute workflow walkthrough with Beaver.
- Follow-up after one week of usage.
-
-### Evidence To Collect
-
- Current workflow steps.
- Time spent today.
- Existing tools/files/systems involved.
- Review/approval requirements.
- Trust blockers.
- Repeat frequency.
- What would count as a successful pilot.
-
-### Pass Criteria
-
- At least 3 workflows are repeated weekly or more.
- At least 2 workflows involve files or external tools.
- At least 2 stakeholders require evidence/auditability.
- At least 1 team lead agrees to a real pilot workflow.
-
-## 5. Product Workflow Validation
-
-### Test 1: First Accepted Task
-
-Goal: user reaches first accepted task.
-
-Steps:
-
-1. Register or log in.
-2. Configure provider.
-3. Start from a suggested workflow or freeform chat.
-4. Upload or reference a file if needed.
-5. Let Beaver create/continue a task.
-6. Inspect output and evidence.
-7. Accept or request revision.
-
-Pass criteria:
-
- User completes without developer assistance.
- First accepted task occurs in one session.
- User can explain what Beaver did.
-
-### Test 2: Revision Loop
-
-Goal: prove Beaver handles "not good enough yet."
-
-Steps:
-
-1. Run a task.
-2. Ask for a specific revision.
-3. Confirm the same task context continues.
-4. Accept revised output.
-
-Pass criteria:
-
- Revision feedback is preserved.
- Task timeline shows revision.
- User does not need to restate full context.
-
-### Test 3: Evidence Review
-
-Goal: verify trust and auditability.
-
-Steps:
-
-1. Give reviewer a completed task detail page.
-2. Ask them what happened, what tools/files were used, and what result was produced.
-3. Ask whether they would approve the output.
-
-Pass criteria:
-
- >=80% reviewers identify the key actions and artifacts.
- Reviewers can state at least one risk or confidence reason.
-
-### Test 4: Skill Reuse
-
-Goal: prove accepted work can compound.
-
-Steps:
-
-1. Accept a task.
-2. Generate skill candidate/draft.
-3. Review and publish skill.
-4. Run a similar task.
-5. Check whether skill activates and improves work.
-
-Pass criteria:
-
- At least 3 pilot skills are reused twice.
- Users report lower effort on repeated task.
-
-### Test 5: Scheduled Workflow
-
-Goal: validate proactive work.
-
-Steps:
-
-1. Create scheduled job.
-2. Trigger or wait for scheduled run.
-3. Review notification/output.
-4. Accept or revise.
-
-Pass criteria:
-
- Scheduled run is visible.
- Output enters review flow.
- Failed run has clear recovery path.
-
-## 6. Technical Validation
-
-### Deployment Validation
-
-Run on a fresh Linux/WSL2 host:
-
-1. Build images.
-2. Create Docker network.
-3. Start router proxy.
-4. Start authz service.
-5. Start deploy control.
-6. Start auth portal.
-7. Register user.
-8. Configure provider.
-9. Open app instance.
-10. Complete first task.
-
-Pass criteria:
-
- Under 2 hours with docs only.
- No undocumented environment variables.
- Public exposure limited to auth portal and router proxy.
-
-### Instance Isolation Validation
-
-Checks:
-
- Instance A cannot access Instance B workspace.
- User file roots stay scoped.
- Router sends host to correct container.
- Provider config is instance-specific.
- Deleting one instance does not affect another.
-
-Pass criteria:
-
- No cross-instance reads/writes.
- Registry state remains consistent.
-
-### Runtime Validation
-
-Checks:
-
- Chat API.
- WebSocket/runtime status.
- Task creation and deletion.
- Task detail events.
- File upload/preview/download/delete.
- Tool test.
- Skill candidate/draft/review/publish.
- Cron create/toggle/run/delete.
- Settings provider save.
- Runtime restart.
-
-Pass criteria:
-
- Critical user flows pass on desktop and mobile viewport.
- Failure states have visible recovery.
-
-## 7. Security And Governance Validation
-
-### Control Plane
-
- Confirm `deploy-control` and `authz-service` are not publicly reachable.
- Confirm tokens are required for control-plane calls.
- Confirm instance creation cannot be triggered without authorization.
-
-### Files
-
- Confirm only allowed user roots are visible.
- Confirm absolute-style or cross-prefix paths are rejected.
- Confirm delete operations require explicit user action.
-
-### Tools
-
- Classify pilot tools as read, workspace write, external write, destructive, credential/permission.
- Record tool calls in task evidence.
- Block or require review for dangerous actions.
-
-### Connectors
-
- Use sandbox/test accounts for pilot when possible.
- Confirm callback base URL is per-instance.
- Confirm disconnect/reconnect path.
-
-### Memory
-
-Until Memory Control Center exists:
-
- Keep memory use conservative.
- Document what is stored.
- Avoid enabling opaque long-term memory for sensitive pilots.
-
-## 8. Usability Validation
-
-Viewports:
-
- 320px.
- 375px.
- 390px.
- 768px.
- 1024px.
- 1365px.
- One mobile landscape viewport.
-
-Screens:
-
- Auth portal login/register/provider onboarding.
- Chat workbench.
- Task list/detail.
- Files.
- Skills.
- Marketplace.
- Tools.
- Notifications/cron.
- Outlook/connectors if in pilot.
- Settings/status/logs.
-
-Pass criteria:
-
- No horizontal overflow.
- No inaccessible critical controls.
- Touch targets are usable.
- Loading, empty, error, success, and disabled states are visible.
-
-## 9. Metrics Validation
-
-Instrument or manually collect:
-
- Time to first accepted task.
- Accepted tasks per user/team/week.
- Acceptance rate.
- Revision rate.
- Task run failure rate.
- Evidence coverage.
- Skill candidates.
- Skill drafts.
- Published skills.
- Skill reuse.
- Scheduled run success.
- Provider setup failure.
- Instance creation failure.
- Connector setup failure.
-
-Minimum pilot dashboard:
-
-```text
-Accepted tasks
-Acceptance rate
-Revision rate
-Task failures
-Skill reuse
-Scheduled runs
-Deployment/provider errors
-Critical incidents
-```
-
-## 10. Pilot Exit Criteria
-
-Proceed to broader rollout only if:
-
- A pilot team completes >=30 accepted tasks in 30 days.
- At least 2 recurring workflows are active.
- At least 5 skills are created and 3 reused twice.
- Task acceptance rate is >=60%.
- No critical security or deployment incidents occur.
- Fresh deployment can be repeated from docs.
- Admin can diagnose common failures from status/logs/runbook.
- Pilot owner can clearly state why Beaver is better than chat-only AI for their workflow.
-
-## 11. Decision Matrix
-
-| Result | Decision |
-| --- | --- |
-| High task acceptance, low skill reuse | Improve skill learning and workflow templates |
-| High interest, deployment friction | Invest in deploy runbook and health console |
-| Good demos, low recurring use | Revisit target segment and workflow selection |
-| High usage, trust concerns | Prioritize evidence narrative, policy, memory controls |
-| Connector demand dominates | Narrow connector roadmap to one high-value system |
-| Memory concerns dominate | Build Memory Control Center before expansion |
--- a/docs/superpowers/plans/2026-05-22-task-evidence-validation.md
+++ b/docs/superpowers/plans/2026-05-22-task-evidence-validation.md
--- a/docs/superpowers/plans/2026-05-26-task-detail-live-execution.md
+++ b/docs/superpowers/plans/2026-05-26-task-detail-live-execution.md
--- a/docs/superpowers/plans/2026-06-01-channel-runtime-v1.md
+++ b/docs/superpowers/plans/2026-06-01-channel-runtime-v1.md
--- a/docs/superpowers/plans/2026-06-01-terminal-websocket-channel.md
+++ b/docs/superpowers/plans/2026-06-01-terminal-websocket-channel.md
--- a/docs/superpowers/plans/2026-06-02-channel-connectors-foundation.md
+++ b/docs/superpowers/plans/2026-06-02-channel-connectors-foundation.md
--- a/docs/superpowers/plans/2026-06-02-chat-platform-channel-adapters.md
+++ b/docs/superpowers/plans/2026-06-02-chat-platform-channel-adapters.md
--- a/docs/superpowers/plans/2026-06-03-external-connector-backend-runtime.md
+++ b/docs/superpowers/plans/2026-06-03-external-connector-backend-runtime.md
--- a/docs/superpowers/plans/2026-06-03-external-connector-frontend-deploy.md
+++ b/docs/superpowers/plans/2026-06-03-external-connector-frontend-deploy.md
@ -1,792 +0,0 @@
-# External Connector Frontend And Deploy Implementation Plan
-
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-
-**Goal:** Add a connector-driven onboarding UI for Weixin and Feishu/Lark, wire frontend API helpers to backend connector-session APIs, and verify the docker-compose sidecar deployment path.
-
-**Architecture:** The Status page keeps the existing advanced channel config editor, but adds a connector onboarding section backed by `/api/channel-connectors`, `/api/channel-connections`, and `/api/channel-connector-sessions`. Weixin shows QR status; Feishu/Lark shows provider instructions/status. Successful sessions become active without restart through backend dynamic runtime activation.
-
-**Tech Stack:** Next.js 13, React, TypeScript, existing shadcn/Radix UI components, lucide-react, Vitest, Docker Compose.
-
---
-
-## Dependencies
-
-Execute after:
-
- `docs/superpowers/plans/2026-06-03-external-connector-backend-runtime.md`
- `docs/superpowers/plans/2026-06-03-external-connector-sidecar.md`
-
-## Scope
-
-Included:
-
- Frontend TypeScript API helpers and types for connectors, connections, and connector sessions.
- Status page connector onboarding UI.
- QR/instruction modal and polling.
- Logout/revoke action using existing connection revoke API.
- Frontend tests for API mapping and UI state helpers.
- Docker compose smoke verification instructions for local sidecar.
-
-Excluded:
-
- Replacing the advanced `/api/channels` static config editor.
- Live vendor account verification logic inside frontend.
- New top-level navigation route.
-
-## File Structure
-
- Modify `app-instance/frontend/types/index.ts`
-  - Add connector and connector-session types.
- Modify `app-instance/frontend/lib/api.ts`
-  - Add connector API functions.
- Create `app-instance/frontend/lib/channel-connectors.ts`
-  - Small UI state helpers for connector labels/status.
- Create `app-instance/frontend/components/channel-connector-wizard.tsx`
-  - Connector cards, session modal, QR/instruction rendering, poll controls.
- Modify `app-instance/frontend/app/(app)/status/page.tsx`
-  - Fetch connector data and render wizard above advanced Channels list.
- Create `app-instance/frontend/lib/channel-connectors.test.ts`
-  - Helper tests.
- Create `app-instance/frontend/components/channel-connector-wizard.test.tsx`
-  - Component tests if the existing Vitest setup supports React Testing Library; otherwise keep helper tests and verify with typecheck/build.
- Review `docker-compose.external-connectors.yml`
-  - Confirm sidecar env names match backend and frontend assumptions.
-
---
-
-### Task 1: Frontend Types And API Client
-
-**Files:**
- Modify: `app-instance/frontend/types/index.ts`
- Modify: `app-instance/frontend/lib/api.ts`
- Test: `app-instance/frontend/lib/channel-connectors.test.ts`
-
- [ ] **Step 1: Add frontend connector types**
-
-Append to `app-instance/frontend/types/index.ts`:
-
-```ts
-export interface ChannelConnectorDescriptor {
-  kind: string;
-  displayName?: string;
-  display_name?: string;
-  authType?: string;
-  auth_type?: string;
-  providerId?: string;
-  provider_id?: string;
-  capabilities?: string[];
-  available?: boolean;
-  unavailableReason?: string | null;
-}
-
-export interface ChannelConnectionView {
-  connection_id: string;
-  owner_user_id?: string | null;
-  channel_id: string;
-  kind: string;
-  mode: string;
-  display_name: string;
-  account_id: string;
-  status: string;
-  auth_type: string;
-  runtime_config: Record<string, unknown>;
-  capabilities: string[];
-  created_at: string;
-  updated_at: string;
-  last_seen_at?: string | null;
-  last_error?: string | null;
-}
-
-export interface ChannelConnectionResponse {
-  connection: ChannelConnectionView;
-  credentials?: Record<string, string>;
-}
-
-export interface ConnectorSessionView {
-  sessionId: string;
-  kind: string;
-  status: string;
-  qrCode?: string | null;
-  qrImage?: string | null;
-  instructions?: string[];
-  accountId?: string | null;
-  displayName?: string | null;
-  error?: string | null;
-  metadata?: Record<string, unknown>;
-}
-
-export interface ConnectorSessionResponse {
-  session: ConnectorSessionView;
-  connection?: ChannelConnectionView | null;
-}
-```
-
- [ ] **Step 2: Add API imports**
-
-Modify the import list in `app-instance/frontend/lib/api.ts` to include:
-
-```ts
-  ChannelConnectionResponse,
-  ChannelConnectionView,
-  ChannelConnectorDescriptor,
-  ConnectorSessionResponse,
-```
-
- [ ] **Step 3: Add connector API functions**
-
-Append to `app-instance/frontend/lib/api.ts` near the channel API functions:
-
-```ts
-export async function listChannelConnectors(): Promise<ChannelConnectorDescriptor[]> {
-  return fetchJSON('/api/channel-connectors');
-}
-
-export async function listChannelConnections(): Promise<ChannelConnectionView[]> {
-  return fetchJSON('/api/channel-connections');
-}
-
-export async function startConnectorSession(params: {
-  kind: string;
-  displayName?: string;
-  ownerUserId?: string;
-  options?: Record<string, unknown>;
-}): Promise<ConnectorSessionResponse> {
-  return fetchJSON('/api/channel-connector-sessions', {
-    method: 'POST',
-    timeoutMs: 45000,
-    body: JSON.stringify({
-      kind: params.kind,
-      displayName: params.displayName,
-      ownerUserId: params.ownerUserId,
-      options: params.options || {},
-    }),
-  });
-}
-
-export async function getConnectorSession(sessionId: string): Promise<ConnectorSessionResponse> {
-  return fetchJSON(`/api/channel-connector-sessions/${encodeURIComponent(sessionId)}`, {
-    timeoutMs: 45000,
-  });
-}
-
-export async function revokeChannelConnection(connectionId: string): Promise<ChannelConnectionResponse> {
-  return fetchJSON(`/api/channel-connections/${encodeURIComponent(connectionId)}/revoke`, {
-    method: 'POST',
-  });
-}
-```
-
- [ ] **Step 4: Run frontend typecheck**
-
-Run:
-
-```bash
-cd app-instance/frontend
-npm run typecheck
-```
-
-Expected: typecheck passes. If it fails because these types are appended inside another interface, move them below the closing brace for `SystemStatus`.
-
- [ ] **Step 5: Commit Task 1**
-
-```bash
-git add app-instance/frontend/types/index.ts app-instance/frontend/lib/api.ts
-git commit -m "feat: add connector frontend api client"
-```
-
---
-
-### Task 2: Connector UI Helpers
-
-**Files:**
- Create: `app-instance/frontend/lib/channel-connectors.ts`
- Create: `app-instance/frontend/lib/channel-connectors.test.ts`
-
- [ ] **Step 1: Write helper tests**
-
-Create `app-instance/frontend/lib/channel-connectors.test.ts`:
-
-```ts
-import { describe, expect, it } from 'vitest';
-import {
-  connectorDisplayName,
-  connectorStatusLabel,
-  isTerminalConnectorSessionStatus,
-} from './channel-connectors';
-
-describe('channel connector helpers', () => {
-  it('returns friendly connector names', () => {
-    expect(connectorDisplayName({ kind: 'weixin' })).toBe('Weixin');
-    expect(connectorDisplayName({ kind: 'feishu' })).toBe('Feishu/Lark');
-    expect(connectorDisplayName({ kind: 'telegram', displayName: 'Telegram' })).toBe('Telegram');
-  });
-
-  it('maps connector session statuses', () => {
-    expect(connectorStatusLabel('qr_ready')).toBe('QR ready');
-    expect(connectorStatusLabel('waiting_for_user')).toBe('Waiting for user');
-    expect(connectorStatusLabel('connected')).toBe('Connected');
-  });
-
-  it('detects terminal statuses', () => {
-    expect(isTerminalConnectorSessionStatus('connected')).toBe(true);
-    expect(isTerminalConnectorSessionStatus('expired')).toBe(true);
-    expect(isTerminalConnectorSessionStatus('qr_ready')).toBe(false);
-  });
-});
-```
-
- [ ] **Step 2: Run tests to verify failure**
-
-Run:
-
-```bash
-cd app-instance/frontend
-npm run test -- lib/channel-connectors.test.ts
-```
-
-Expected: fail with `Cannot find module './channel-connectors'`.
-
- [ ] **Step 3: Implement helpers**
-
-Create `app-instance/frontend/lib/channel-connectors.ts`:
-
-```ts
-import type { ChannelConnectorDescriptor } from '@/types';
-
-export function connectorDisplayName(connector: Pick<ChannelConnectorDescriptor, 'kind' | 'displayName' | 'display_name'>): string {
-  if (connector.displayName) return connector.displayName;
-  if (connector.display_name) return connector.display_name;
-  if (connector.kind === 'weixin') return 'Weixin';
-  if (connector.kind === 'feishu') return 'Feishu/Lark';
-  if (connector.kind === 'telegram') return 'Telegram';
-  return connector.kind;
-}
-
-export function connectorStatusLabel(status: string): string {
-  const labels: Record<string, string> = {
-    pending: 'Pending',
-    qr_ready: 'QR ready',
-    scanned: 'Scanned',
-    confirmed: 'Confirmed',
-    installing: 'Installing',
-    waiting_for_user: 'Waiting for user',
-    connected: 'Connected',
-    expired: 'Expired',
-    error: 'Error',
-    cancelled: 'Cancelled',
-  };
-  return labels[status] || status;
-}
-
-export function isTerminalConnectorSessionStatus(status: string): boolean {
-  return ['connected', 'expired', 'error', 'cancelled'].includes(status);
-}
-```
-
- [ ] **Step 4: Run helper tests**
-
-Run:
-
-```bash
-cd app-instance/frontend
-npm run test -- lib/channel-connectors.test.ts
-```
-
-Expected: helper tests pass.
-
- [ ] **Step 5: Commit Task 2**
-
-```bash
-git add app-instance/frontend/lib/channel-connectors.ts app-instance/frontend/lib/channel-connectors.test.ts
-git commit -m "feat: add channel connector ui helpers"
-```
-
---
-
-### Task 3: Connector Wizard Component
-
-**Files:**
- Create: `app-instance/frontend/components/channel-connector-wizard.tsx`
- Modify: `app-instance/frontend/app/(app)/status/page.tsx`
-
- [ ] **Step 1: Create wizard component**
-
-Create `app-instance/frontend/components/channel-connector-wizard.tsx`:
-
-```tsx
-'use client';
-
-import React, { useEffect, useMemo, useState } from 'react';
-import { CheckCircle2, Loader2, QrCode, RefreshCw, Unplug } from 'lucide-react';
-import type {
-  ChannelConnectionView,
-  ChannelConnectorDescriptor,
-  ConnectorSessionResponse,
-  ConnectorSessionView,
-} from '@/types';
-import {
-  getConnectorSession,
-  revokeChannelConnection,
-  startConnectorSession,
-} from '@/lib/api';
-import {
-  connectorDisplayName,
-  connectorStatusLabel,
-  isTerminalConnectorSessionStatus,
-} from '@/lib/channel-connectors';
-import { Button } from '@/components/ui/button';
-import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/card';
-import { Badge } from '@/components/ui/badge';
-import {
-  Dialog,
-  DialogContent,
-  DialogFooter,
-  DialogHeader,
-  DialogTitle,
-} from '@/components/ui/dialog';
-import { Input } from '@/components/ui/input';
-import { Label } from '@/components/ui/label';
-
-type Props = {
-  connectors: ChannelConnectorDescriptor[];
-  connections: ChannelConnectionView[];
-  onChanged: () => Promise<void> | void;
-};
-
-export function ChannelConnectorWizard({ connectors, connections, onChanged }: Props) {
-  const [activeKind, setActiveKind] = useState<string | null>(null);
-  const [session, setSession] = useState<ConnectorSessionView | null>(null);
-  const [connection, setConnection] = useState<ChannelConnectionView | null>(null);
-  const [busy, setBusy] = useState(false);
-  const [error, setError] = useState<string | null>(null);
-  const [feishuDomain, setFeishuDomain] = useState('feishu');
-
-  const visibleConnectors = useMemo(
-    () => connectors.filter((item) => ['telegram', 'weixin', 'feishu'].includes(item.kind)),
-    [connectors],
-  );
-
-  useEffect(() => {
-    if (!session || isTerminalConnectorSessionStatus(session.status)) return;
-    const timer = window.setInterval(async () => {
-      try {
-        const next = await getConnectorSession(session.sessionId);
-        setSession(next.session);
-        if (next.connection) setConnection(next.connection);
-        if (next.session.status === 'connected') await onChanged();
-      } catch (err: any) {
-        setError(err.message || 'Failed to refresh connector session');
-      }
-    }, 2000);
-    return () => window.clearInterval(timer);
-  }, [session?.sessionId, session?.status, onChanged]);
-
-  const start = async (kind: string) => {
-    setActiveKind(kind);
-    setSession(null);
-    setConnection(null);
-    setError(null);
-    setBusy(true);
-    try {
-      const options = kind === 'feishu' ? { domain: feishuDomain } : {};
-      const response: ConnectorSessionResponse = await startConnectorSession({
-        kind,
-        displayName: connectorDisplayName({ kind }),
-        options,
-      });
-      setSession(response.session);
-      setConnection(response.connection || null);
-    } catch (err: any) {
-      setError(err.message || 'Failed to start connector session');
-    } finally {
-      setBusy(false);
-    }
-  };
-
-  const revoke = async (item: ChannelConnectionView) => {
-    setBusy(true);
-    setError(null);
-    try {
-      await revokeChannelConnection(item.connection_id);
-      await onChanged();
-    } catch (err: any) {
-      setError(err.message || 'Failed to logout connector');
-    } finally {
-      setBusy(false);
-    }
-  };
-
-  return (
-    <section className="space-y-3">
-      <div className="grid gap-3 md:grid-cols-3">
-        {visibleConnectors.map((connector) => {
-          const existing = connections.find((item) => item.kind === connector.kind && item.status !== 'revoked');
-          return (
-            <Card key={connector.kind} className="rounded-md">
-              <CardHeader className="pb-2">
-                <CardTitle className="flex items-center justify-between text-base">
-                  <span>{connectorDisplayName(connector)}</span>
-                  {existing ? <Badge variant="secondary">{existing.status}</Badge> : null}
-                </CardTitle>
-              </CardHeader>
-              <CardContent className="space-y-3">
-                {connector.kind === 'feishu' ? (
-                  <div className="space-y-1">
-                    <Label htmlFor="feishu-domain">Domain</Label>
-                    <Input id="feishu-domain" value={feishuDomain} onChange={(event) => setFeishuDomain(event.target.value)} />
-                  </div>
-                ) : null}
-                {existing ? (
-                  <div className="flex items-center justify-between gap-2 text-sm">
-                    <span className="truncate">{existing.display_name || existing.account_id || existing.channel_id}</span>
-                    <Button size="sm" variant="outline" onClick={() => revoke(existing)} disabled={busy}>
-                      <Unplug className="mr-2 h-4 w-4" />
-                      Logout
-                    </Button>
-                  </div>
-                ) : (
-                  <Button size="sm" onClick={() => start(connector.kind)} disabled={busy || connector.kind === 'telegram'}>
-                    {busy && activeKind === connector.kind ? <Loader2 className="mr-2 h-4 w-4 animate-spin" /> : <QrCode className="mr-2 h-4 w-4" />}
-                    {connector.kind === 'telegram' ? 'Use token setup' : 'Connect'}
-                  </Button>
-                )}
-              </CardContent>
-            </Card>
-          );
-        })}
-      </div>
-
-      {error ? <p className="text-sm text-destructive">{error}</p> : null}
-
-      <Dialog open={Boolean(activeKind && session)} onOpenChange={(open) => !open && setActiveKind(null)}>
-        <DialogContent className="sm:max-w-[520px]">
-          <DialogHeader>
-            <DialogTitle>{activeKind ? connectorDisplayName({ kind: activeKind }) : 'Connector'}</DialogTitle>
-          </DialogHeader>
-          {session ? (
-            <div className="space-y-4">
-              <div className="flex items-center justify-between">
-                <Badge variant={session.status === 'connected' ? 'default' : 'secondary'}>
-                  {connectorStatusLabel(session.status)}
-                </Badge>
-                {session.status === 'connected' ? <CheckCircle2 className="h-5 w-5 text-emerald-600" /> : <RefreshCw className="h-5 w-5 text-muted-foreground" />}
-              </div>
-              {session.qrImage ? (
-                <img alt="Connector QR code" src={session.qrImage} className="mx-auto aspect-square w-64 rounded-md border object-contain" />
-              ) : null}
-              {session.instructions && session.instructions.length > 0 ? (
-                <div className="space-y-2 rounded-md border p-3 text-sm">
-                  {session.instructions.map((item) => <p key={item}>{item}</p>)}
-                </div>
-              ) : null}
-              {connection ? <p className="text-sm text-muted-foreground">{connection.display_name || connection.account_id}</p> : null}
-              {session.error ? <p className="text-sm text-destructive">{session.error}</p> : null}
-            </div>
-          ) : null}
-          <DialogFooter>
-            <Button variant="outline" onClick={() => setActiveKind(null)}>Close</Button>
-          </DialogFooter>
-        </DialogContent>
-      </Dialog>
-    </section>
-  );
-}
-```
-
- [ ] **Step 2: Wire Status page imports**
-
-Modify imports in `app-instance/frontend/app/(app)/status/page.tsx`:
-
-```tsx
-import { ChannelConnectorWizard } from '@/components/channel-connector-wizard';
-import { getChannelConfig, getStatus, listChannelConnections, listChannelConnectors, listChannelEvents, restartRuntime, updateAgentConfig, updateChannelConfig, updateProviderConfig } from '@/lib/api';
-import type { ChannelConfigDetail, ChannelConnectionView, ChannelConnectorDescriptor, ChannelEventRecord, ChannelStatus, ProviderStatus, SystemStatus } from '@/types';
-```
-
- [ ] **Step 3: Add connector state to Status page**
-
-Inside `StatusPage()` state declarations:
-
-```tsx
-  const [channelConnectors, setChannelConnectors] = useState<ChannelConnectorDescriptor[]>([]);
-  const [channelConnections, setChannelConnections] = useState<ChannelConnectionView[]>([]);
-```
-
-Add loader:
-
-```tsx
-  const loadChannelConnectors = async () => {
-    const [connectors, connections] = await Promise.all([
-      listChannelConnectors(),
-      listChannelConnections(),
-    ]);
-    setChannelConnectors(connectors);
-    setChannelConnections(connections);
-  };
-```
-
-Call it after status load:
-
-```tsx
-  useEffect(() => {
-    loadStatus();
-    loadChannelConnectors().catch(() => undefined);
-  }, []);
-```
-
-In `handleSaveChannel()` after `await loadStatus();`, add:
-
-```tsx
-      await loadChannelConnectors();
-```
-
- [ ] **Step 4: Render wizard above advanced Channels list**
-
-In `app-instance/frontend/app/(app)/status/page.tsx`, render before the existing `{/* Channels */}` section:
-
-```tsx
-      <section className="space-y-3">
-        <div>
-          <h2 className="text-lg font-semibold">{pickAppText(locale, '连接器', 'Connectors')}</h2>
-          <p className="text-sm text-muted-foreground">
-            {pickAppText(locale, '连接微信或飞书后会立即进入运行时。', 'Connected Weixin or Feishu channels activate immediately.')}
-          </p>
-        </div>
-        <ChannelConnectorWizard
-          connectors={channelConnectors}
-          connections={channelConnections}
-          onChanged={async () => {
-            await loadChannelConnectors();
-            await loadStatus();
-          }}
-        />
-      </section>
-```
-
- [ ] **Step 5: Run frontend checks**
-
-Run:
-
-```bash
-cd app-instance/frontend
-npm run typecheck
-npm run test -- lib/channel-connectors.test.ts
-```
-
-Expected: typecheck and helper tests pass.
-
- [ ] **Step 6: Commit Task 3**
-
-```bash
-git add app-instance/frontend/components/channel-connector-wizard.tsx app-instance/frontend/app/'(app)'/status/page.tsx
-git commit -m "feat: add channel connector wizard"
-```
-
---
-
-### Task 4: Frontend Build And Browser Smoke
-
-**Files:**
- Review: `app-instance/frontend/app/(app)/status/page.tsx`
- Review: `app-instance/frontend/components/channel-connector-wizard.tsx`
-
- [ ] **Step 1: Run frontend build**
-
-Run:
-
-```bash
-cd app-instance/frontend
-npm run build
-```
-
-Expected: Next build succeeds.
-
- [ ] **Step 2: Start frontend dev server if visual smoke is needed**
-
-Run:
-
-```bash
-cd app-instance/frontend
-npm run dev
-```
-
-Expected: dev server listens on `http://127.0.0.1:3080`.
-
- [ ] **Step 3: Browser smoke check**
-
-Open the Status page in the running app instance and verify:
-
- The Connectors section appears above Channels.
- Telegram shows token setup disabled in the connector wizard.
- Weixin has a Connect button.
- Feishu/Lark has a Domain input and Connect button.
- Starting a fake Weixin session opens a modal with a QR image.
-
- [ ] **Step 4: Stop frontend dev server**
-
-If Step 2 started a dev server, stop it with `Ctrl-C`.
-
- [ ] **Step 5: Commit fixes if needed**
-
-If build or smoke required fixes:
-
-```bash
-git add app-instance/frontend
-git commit -m "fix: stabilize channel connector wizard"
-```
-
-If no files changed, do not create an empty commit.
-
---
-
-### Task 5: Compose Integration Verification
-
-**Files:**
- Review: `docker-compose.external-connectors.yml`
- Review: `.env.example`
-
- [ ] **Step 1: Build backend and sidecar images**
-
-Run:
-
-```bash
-docker build -t beaver/app-instance:latest app-instance
-docker compose -f docker-compose.external-connectors.yml build external-connector
-```
-
-Expected: both builds succeed.
-
- [ ] **Step 2: Start sidecar with fake provider**
-
-Run:
-
-```bash
-CONNECTOR_PROVIDER=fake \
-EXTERNAL_CONNECTOR_TOKEN=dev-token \
-BEAVER_BRIDGE_TOKEN=dev-token \
-docker compose -f docker-compose.external-connectors.yml up -d external-connector
-```
-
-Expected: `external-connector` starts and stays running.
-
- [ ] **Step 3: Verify sidecar connector API**
-
-Run:
-
-```bash
-curl -sS -H 'Authorization: Bearer dev-token' http://127.0.0.1:8787/connectors
-```
-
-Expected: JSON contains `weixin` and `feishu`.
-
- [ ] **Step 4: Attach sidecar to Beaver instance network**
-
-For a local `create-instance.sh` deployment using `beaver-instance-edge`, run:
-
-```bash
-docker network connect beaver-instance-edge external-connector 2>/dev/null || true
-```
-
-Expected: command succeeds or reports that the endpoint already exists.
-
- [ ] **Step 5: Restart target app instance with connector env**
-
-For `terminaltest`, ensure the app container has:
-
-```dotenv
-EXTERNAL_CONNECTOR_BASE_URL=http://external-connector:8787
-EXTERNAL_CONNECTOR_TOKEN=dev-token
-BEAVER_BRIDGE_TOKEN=dev-token
-EXTERNAL_CONNECTOR_CALLBACK_BASE_URL=http://app-instance-terminaltest:8080
-```
-
-Then recreate the instance with the deployment script used by this repo. Do not mount `/var/run/docker.sock` into Beaver.
-For multi-instance deployments, this callback URL must point at the specific app-instance container that owns the connection; the shared sidecar stores it per connector session and uses it for inbound events.
-
- [ ] **Step 6: Manual fake-provider onboarding**
-
-In `terminaltest`:
-
- Open Status.
- Click Weixin Connect.
- Confirm QR modal appears.
- Poll until fake status remains visible.
- Confirm backend `/api/channel-connectors` returns `telegram`, `weixin`, and `feishu`.
-
- [ ] **Step 7: Stop fake sidecar if no longer needed**
-
-Run:
-
-```bash
-docker compose -f docker-compose.external-connectors.yml down
-```
-
-Expected: sidecar stops; named volume remains.
-
---
-
-### Task 6: Final Frontend And Deploy Verification
-
-**Files:**
- Review: `docs/superpowers/specs/2026-06-02-external-sidecar-connectors-design.md`
-
- [ ] **Step 1: Run frontend verification**
-
-Run:
-
-```bash
-cd app-instance/frontend
-npm run typecheck
-npm run build
-npm run test -- lib/channel-connectors.test.ts
-```
-
-Expected: all commands pass.
-
- [ ] **Step 2: Run backend connector smoke tests**
-
-Run:
-
-```bash
-cd app-instance/backend
-uv run pytest \
-  tests/unit/test_external_sidecar_connectors.py \
-  tests/unit/test_external_connector_bridge_api.py \
-  tests/unit/test_channel_runtime_dynamic_channels.py \
-  -q
-```
-
-Expected: all listed tests pass.
-
- [ ] **Step 3: Run sidecar verification**
-
-Run:
-
-```bash
-cd external-connector
-uv run pytest -q
-```
-
-Expected: all sidecar tests pass.
-
- [ ] **Step 4: Scan for provider-runtime naming in new files**
-
-Run:
-
-```bash
-rg -n "[Oo]pen[Cc]law" docs/superpowers app-instance/frontend external-connector docker-compose.external-connectors.yml || true
-```
-
-Expected: no matches.
-
- [ ] **Step 5: Commit verification fixes if needed**
-
-If any verification step required fixes:
-
-```bash
-git add app-instance/frontend external-connector docker-compose.external-connectors.yml docs/superpowers
-git commit -m "fix: stabilize external connector onboarding"
-```
-
-If no files changed, do not create an empty commit.
--- a/docs/superpowers/plans/2026-06-03-external-connector-sidecar.md
+++ b/docs/superpowers/plans/2026-06-03-external-connector-sidecar.md
--- a/docs/superpowers/plans/2026-06-04-auto-accept-on-new-topic.md
+++ b/docs/superpowers/plans/2026-06-04-auto-accept-on-new-topic.md
@ -1,73 +0,0 @@
-# Auto-Accept on New Topic Implementation Plan
-
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-
-**Goal:** Silently accept an awaiting Task before processing an unrelated new topic.
-
-**Architecture:** Keep the existing Intent Agent actions. Treat `simple_chat` and `new_task` decisions made while a Task is active as new-topic boundaries, reuse `submit_acceptance()` for the old Task's latest run, and then continue the original routing decision.
-
-**Tech Stack:** Python, pytest, Beaver TaskService and AgentService
-
---
-
-### Task 1: Lock the State Transition with Tests
-
-**Files:**
- Modify: `app-instance/backend/tests/unit/test_task_mode_feedback.py`
-
- [ ] Add a failing test proving an unrelated `simple_chat` message formally accepts the previous Task and does not append another run to it.
- [ ] Add a failing test proving `new_task` formally accepts the previous Task before creating a separate Task.
- [ ] Add tests proving `continue_task` and `revise_task` retain the existing active Task behavior.
- [ ] Run:
-
-```bash
-uv run pytest -q tests/unit/test_task_mode_feedback.py
-```
-
-Expected before implementation: the new-topic tests fail because the previous Task remains `awaiting_acceptance`.
-
-### Task 2: Implement New-Topic Auto-Accept
-
-**Files:**
- Modify: `app-instance/backend/beaver/services/agent_service.py`
-
- [ ] Add a focused async helper that accepts only an `awaiting_acceptance` Task with a latest run.
- [ ] Call the helper after routing when the decision is `simple_chat` or starts a new Task.
- [ ] Reuse `submit_acceptance()` so acceptance history, final accepted run, run memory, and learning behavior remain consistent.
- [ ] Run:
-
-```bash
-uv run pytest -q tests/unit/test_task_mode_feedback.py
-```
-
-Expected: all task-mode feedback tests pass.
-
-### Task 3: Clarify Intent Routing Guidance
-
-**Files:**
- Modify: `app-instance/backend/beaver/tasks/router.py`
- Modify: `app-instance/backend/beaver/skills/builtin/intent-agent-router/SKILL.md`
- Modify: `app-instance/backend/tests/unit/test_main_agent_router.py`
-
- [ ] Assert the generated routing prompt explicitly says unrelated lightweight conversation is `simple_chat`, not `revise_task`.
- [ ] Update both routing guidance sources with the same rule and examples.
- [ ] Run:
-
-```bash
-uv run pytest -q tests/unit/test_main_agent_router.py
-```
-
-Expected: all router tests pass.
-
-### Task 4: Regression Verification
-
-**Files:**
- Verify only
-
- [ ] Run:
-
-```bash
-uv run pytest -q tests/unit/test_main_agent_router.py tests/unit/test_task_mode_feedback.py tests/unit/test_active_task_api.py tests/unit/test_process_projection.py
-```
-
- [ ] Inspect the final diff to confirm no frontend confirmation or unrelated state changes were introduced.
--- a/docs/superpowers/plans/2026-06-04-chat-task-timeline-consistency.md
+++ b/docs/superpowers/plans/2026-06-04-chat-task-timeline-consistency.md
@ -1,75 +0,0 @@
-# Chat Task Timeline Consistency Implementation Plan
-
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-
-**Goal:** Render the active Task's canonical timeline in the chat progress sidebar and hide it when no active Task exists.
-
-**Architecture:** Extract task-scoped process filtering into a shared frontend helper, use it in both Task detail and chat, and make the chat sidebar a responsive wrapper around the existing `TaskTimeline` component.
-
-**Tech Stack:** React, Next.js, TypeScript, Vitest
-
---
-
-### Task 1: Extract Shared Task Process Selection
-
-**Files:**
- Create: `app-instance/frontend/lib/task-process.ts`
- Create: `app-instance/frontend/lib/task-process.test.ts`
- Modify: `app-instance/frontend/app/(app)/tasks/[taskId]/page.tsx`
-
- [ ] Write failing tests for merging persisted task process data with matching live process data.
- [ ] Implement `selectTaskProcess()` returning task-scoped runs, events, and artifacts.
- [ ] Replace the Task detail page's local filtering with the shared helper.
- [ ] Run:
-
-```bash
-npm test -- --run lib/task-process.test.ts lib/task-timeline.test.ts
-```
-
-### Task 2: Replace Chat Progress View with Task Timeline
-
-**Files:**
- Modify: `app-instance/frontend/components/chat-workbench/CurrentSessionProgressSidebar.tsx`
- Modify: `app-instance/frontend/app/(app)/page.tsx`
-
- [ ] Load full `BackendTask` detail whenever `activeTask` exists.
- [ ] Clear full Task detail whenever active Task becomes `null` or the session changes.
- [ ] Build chat timeline cards using `selectTaskProcess()` and `buildTaskTimelineCards()`.
- [ ] Change `CurrentSessionProgressSidebar` to accept timeline cards and render `TaskTimeline` without acceptance controls.
- [ ] Remove the chat page's use of `buildSessionProgressView()`.
-
-### Task 3: Add Visibility and Consistency Tests
-
-**Files:**
- Modify: `app-instance/frontend/lib/task-process.test.ts`
- Modify: `app-instance/frontend/lib/task-timeline.test.ts`
- Delete if unused: `app-instance/frontend/lib/session-progress.test.ts`
- Delete if unused: `app-instance/frontend/lib/session-progress.ts`
-
- [ ] Cover empty/no-active input behavior in the shared helper.
- [ ] Confirm the same Task/process input creates the same timeline cards on both surfaces.
- [ ] Remove the obsolete session-progress builder and tests if no imports remain.
- [ ] Run:
-
-```bash
-npm test
-```
-
-### Task 4: Frontend Verification
-
-**Files:**
- Verify only
-
- [ ] Run:
-
-```bash
-npm run typecheck
-npm run build
-```
-
- [ ] Validate the rendered chat flow with Playwright because the Browser plugin is not available:
-
-```text
-chat page with active Task -> open current-session progress -> same timeline cards as Task detail
-Task closes -> current-session progress disappears
-```
--- a/docs/superpowers/plans/2026-06-04-initial-multi-search-engine.md
+++ b/docs/superpowers/plans/2026-06-04-initial-multi-search-engine.md
@ -1,104 +0,0 @@
-# Initial Multi Search Engine Implementation Plan
-
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-
-**Goal:** Replace the initial `web-operation` skill with SkillHub `multi-search-engine` while keeping `web_fetch` reliably available when the skill is selected.
-
-**Architecture:** Initial skills are copied from the repository `skills/` directory into each instance workspace by `create-instance.sh` and `entrypoint.sh`. This change updates the seed catalog, not existing user workspace state.
-
-**Tech Stack:** Python skill catalog storage, JSON seed metadata, Markdown `SKILL.md`, pytest.
-
---
-
-### Task 1: Update Initial Skill Contract
-
-**Files:**
- Modify: `app-instance/backend/tests/unit/test_initial_skill_tool_hints.py`
-
- [ ] **Step 1: Write the failing test**
-
-Change `EXPECTED_INITIAL_SKILL_TOOLS` so it expects:
-
-```python
-"multi-search-engine": ["web_fetch"],
-```
-
-and no longer expects:
-
-```python
-"web-operation": ["web_fetch", "web_search"],
-```
-
- [ ] **Step 2: Run test to verify it fails**
-
-Run:
-
-```bash
-cd app-instance/backend
-pytest tests/unit/test_initial_skill_tool_hints.py -q
-```
-
-Expected: FAIL because `skills/multi-search-engine/versions/v0001/SKILL.md` does not exist yet.
-
-### Task 2: Replace Seed Skill
-
-**Files:**
- Create: `skills/multi-search-engine/current.json`
- Create: `skills/multi-search-engine/skill.json`
- Create: `skills/multi-search-engine/versions/v0001/SKILL.md`
- Create: `skills/multi-search-engine/versions/v0001/version.json`
- Create: `skills/multi-search-engine/versions/v0001/CHANGELOG.md`
- Create: `skills/multi-search-engine/versions/v0001/CHANNELLOG.md`
- Create: `skills/multi-search-engine/versions/v0001/config.json`
- Create: `skills/multi-search-engine/versions/v0001/metadata.json`
- Create: `skills/multi-search-engine/versions/v0001/references/advanced-search.md`
- Create: `skills/multi-search-engine/versions/v0001/references/international-search.md`
- Modify: `skills/_index/published.json`
-
- [ ] **Step 1: Add SkillHub content**
-
-Fetch `global/multi-search-engine@20260413.065325` from SkillHub and store it as seed version `v0001`.
-
- [ ] **Step 2: Add tool hint**
-
-Ensure `SKILL.md` frontmatter contains:
-
-```yaml
-tools:
-  - web_fetch
-```
-
- [ ] **Step 3: Update published index**
-
-Remove `web-operation` and add `multi-search-engine`.
-
-### Task 3: Verify
-
-**Files:**
- Test: `app-instance/backend/tests/unit/test_initial_skill_tool_hints.py`
-
- [ ] **Step 1: Run targeted tests**
-
-Run:
-
-```bash
-cd app-instance/backend
-pytest tests/unit/test_initial_skill_tool_hints.py tests/unit/test_marketplace_and_mcp.py -q
-```
-
-Expected: PASS.
-
- [ ] **Step 2: Inspect seed metadata**
-
-Run:
-
-```bash
-python - <<'PY'
-import json
-from pathlib import Path
-print(json.loads(Path('skills/_index/published.json').read_text())['items'])
-print(json.loads(Path('skills/multi-search-engine/versions/v0001/version.json').read_text())['tool_hints'])
-PY
-```
-
-Expected: `multi-search-engine` is published, `web-operation` is absent, and tool hints are `["web_fetch"]`.
--- a/docs/superpowers/plans/2026-06-08-skill-replay-eval.md
+++ b/docs/superpowers/plans/2026-06-08-skill-replay-eval.md
--- a/docs/superpowers/specs/2026-05-22-task-evidence-validation-design.md
+++ b/docs/superpowers/specs/2026-05-22-task-evidence-validation-design.md
@ -1,265 +0,0 @@
-# Task Evidence and Validation Redesign
-
-Date: 2026-05-22
-
-## Context
-
-Two recent task runs exposed the same underlying weakness from different angles:
-
- The agent can use complete tool results, but validation only receives truncated excerpts. A key fact can be present in the run log yet absent from validation context, causing a false rejection.
- Team execution can gather useful evidence in sub-agent runs, but that evidence is not reliably carried into final synthesis or validation. Failed team nodes are especially lossy.
- Team graphs marked `parallel` are currently scheduled through a shared single-consumer `AgentLoop`, so production execution can be effectively serial.
- Final synthesis after a team run still has full tools available, so it can repeat searches instead of synthesizing from team evidence.
- `max_tool_iterations` stops the tool loop with a placeholder message instead of forcing a final answer from already gathered evidence.
- Validation failures enter an open-looking state, which makes the UI feel like the task never completed.
-
-The selected approach is a medium refactor: keep the existing `AgentService`, `TeamService`, and `AgentLoop` structure, but add a structured evidence pipeline, clearer validation semantics, finite team concurrency, no-tools synthesis after team runs, and explicit task states.
-
-## Goals
-
- Preserve complete run evidence for synthesis and validation.
- Stop using fixed truncation for validation inputs.
- Distinguish "answer is contradicted" from "validator lacks enough evidence".
- Let user feedback be the final business judgment after an answer is shown.
- Make `parallel` team execution actually concurrent within a bounded limit.
- Prevent final synthesis from repeating team tool work by default.
- Produce a useful final answer when tool iteration limits are reached.
- Add enough debug metadata to diagnose validation decisions without reconstructing SQLite logs by hand.
-
-## Non-Goals
-
- Rewriting the whole execution runtime.
- Introducing a distributed worker pool.
- Building a generic evidence bus for every future subsystem.
- Solving all provider rate-limit and storage concurrency concerns beyond the bounded local concurrency needed for team parallel nodes.
-
-## Validation Semantics and Task States
-
-Automatic validation becomes advisory evidence assessment, not the final user satisfaction signal.
-
-Validation results should include:
-
-```python
-status: Literal["accepted", "rejected", "insufficient_evidence", "validator_error"]
-passed: bool
-score: float
-issues: list[str]
-missing_requirements: list[str]
-evidence_gaps: list[str]
-recommended_revision_prompt: str
-```
-
-`status` is the business decision field. `passed` is a compatibility boolean derived from `status`, not an independent source of truth. The mapping is:
-
- `status == "accepted"` -> `passed=True`
- `status in {"rejected", "insufficient_evidence", "validator_error"}` -> `passed=False`
-
-Task mode, retry, and status transition logic must branch on `status`. New code treats `status == "accepted"` as the acceptance condition. Existing compatibility paths may continue to interpret acceptance as `passed and score >= 0.75` until they are migrated, but new logic should not derive status from `passed` or infer failure from `passed=False` alone.
-
-Rules:
-
- `accepted`: the final answer is supported by available evidence and satisfies the task. The task enters `awaiting_feedback`.
- `insufficient_evidence`: the validator cannot confirm the answer from available evidence. It must not claim fabrication or contradiction. The task enters `needs_review`.
- `validator_error`: the validator failed to produce a reliable decision. The task enters `needs_review`.
- `rejected`: the evidence clearly contradicts the answer, or the answer clearly misses the task. The first attempt can trigger retry. The last attempt enters `failed` only when there is no usable answer; otherwise it enters `needs_review`.
-
-Task statuses:
-
- `open`: task exists but has not started.
- `running`: execution is active.
- `validating`: final answer exists and automatic validation is running.
- `awaiting_feedback`: answer is available and automatic validation accepted it.
- `needs_review`: answer is available, but automatic validation could not confirm it or hit a validator error.
- `needs_revision`: user requested revision, or automatic validation rejected an attempt that can still be retried.
- `failed`: execution ended without a usable answer.
- `closed`: user marked the answer satisfied.
- `abandoned`: user abandoned the task.
-
-`needs_review` remains an open status for the active task API, but the UI should distinguish it from `running`. `failed`, `closed`, and `abandoned` are terminal.
-
-Open status does not mean auto-runnable. The backend should split status semantics:
-
- `is_open`: the task can still receive user feedback or revision.
- `is_execution_active`: the backend is currently running or validating work.
- `requires_user_action`: the task has stopped automatic execution and needs user input.
-
-`needs_review` should have `is_open=True`, `is_execution_active=False`, and `requires_user_action=True`. Schedulers, automatic retry loops, and active-task polling must not treat `needs_review` as a reason to continue execution. It should appear in the active task API only so the user can review, mark satisfied, revise, or abandon.
-
-User feedback is authoritative:
-
- `satisfied` closes the task.
- `revise` moves the task to `needs_revision`.
- `abandon` moves the task to `abandoned`.
-
-## Evidence Models
-
-Add structured evidence models in the task or coordinator layer.
-
-```python
-@dataclass(slots=True)
-class ToolEvidence:
-    tool_name: str
-    tool_call_id: str | None
-    content: str
-    event_payload: dict[str, Any]
-    url: str | None = None
-    title: str | None = None
-    created_at: str | None = None
-
-
-@dataclass(slots=True)
-class RunEvidence:
-    run_id: str
-    session_id: str
-    output_text: str
-    finish_reason: str
-    transcript: list[dict[str, Any]]
-    tool_results: list[ToolEvidence]
-    warnings: list[str]
-
-
-@dataclass(slots=True)
-class TaskEvidencePacket:
-    task_id: str
-    attempt_index: int
-    main_run: RunEvidence | None
-    team_runs: list[RunEvidence]
-    team_node_results: list[NodeRunResult]
-    final_output: str
-```
-
-`llm_request_snapshotted` events are debug material, not task evidence. They may be referenced in validation debug metadata, but validation should primarily consume transcript, tool results, team node outputs, and final output.
-
-## Evidence Data Flow
-
-1. `AgentLoop` continues to write session events as it does now.
-2. After a run completes, an evidence builder reads `session_manager.get_run_event_records(session_id, run_id)` and creates `RunEvidence`.
-3. `LocalAgentRunner.run()` attaches `RunEvidence` to `NodeRunResult`.
-4. `NodeRunResult` gains `evidence: RunEvidence | None`.
-5. `TeamRunResult` carries node evidence through `node_results`; it may also expose a convenience `run_evidence` list.
-6. `AgentService._run_task_mode()` builds a `TaskEvidencePacket` after team execution and final synthesis.
-7. Final synthesis receives a rendered evidence context built from the same packet.
-8. `ValidationService.validate_task_result()` receives the same packet and renders it into the validation prompt without fixed truncation.
-
-Failed or partial nodes must still preserve evidence. A node with `finish_reason="max_tool_iterations"` can be unsuccessful while still carrying useful tool results.
-
-## Final Synthesis Behavior
-
-For team-backed task plans, final synthesis defaults to no tools:
-
-```python
-include_tools = False
-max_tool_iterations = 0
-```
-
-The synthesis prompt should instruct the main agent to:
-
- use team evidence as the source of truth;
- avoid repeating failed or completed tool calls;
- answer with available evidence;
- clearly state missing or uncertain information.
-
-The planner may explicitly allow a small synthesis tool budget, but the default is no-tools synthesis. If allowed, the budget should be small, such as `max_tool_iterations=1`.
-
-## Tool Iteration Finalization
-
-When a run reaches `max_tool_iterations` and the model still requests tools, the loop should not return `Tool loop stopped...` as the final user-visible answer.
-
-Instead, the loop performs one no-tools finalization call:
-
- use the accumulated messages and tool results;
- call the provider with `tools=None`;
- add an instruction that the tool budget is exhausted and the model must answer from existing evidence;
- mark the finish reason as `max_tool_iterations_finalized` or another explicit non-stop value;
- return the finalization text as `output_text`.
-
-If finalization itself fails or returns empty content, only then use a clear fallback message explaining that the run could not produce a usable answer.
-
-## Limited Parallel Team Execution
-
-`parallel` team nodes should run concurrently without rewriting the runtime.
-
-Design:
-
- Keep sequence and DAG behavior on the shared loop where appropriate.
- For `parallel` graph batches, run nodes through isolated `AgentLoop` instances.
- Each isolated loop uses the same workspace and service configuration so session and run records remain queryable from the same stores.
- Add `max_parallel_team_nodes`, default `3`.
- Use an `asyncio.Semaphore` in the scheduler to bound concurrent nodes.
- Return `TeamRunResult.node_results` in graph node order, not completion order.
-
-The implementation should check shared store concurrency. If the current store is not safe for concurrent async writes, add a narrow lock around session/task/run store writes used by these parallel runs.
-
-## Validation Prompt
-
-The validation prompt should consume the full rendered evidence packet, without `[:2500]`, `[:500]`, or `[:12]` fixed caps.
-
-Required validator instructions:
-
- Return only JSON with the validation fields.
- If evidence is incomplete, return `insufficient_evidence`.
- Only return `rejected` for clear contradiction or clear task failure.
- Do not infer fabrication from missing evidence.
- Do not claim a source lacks a fact unless the rendered evidence proves that absence.
- Treat user feedback as the final business judgment outside automatic validation.
-
-The validator should still be strict about answer quality when evidence is sufficient.
-
-## Validation Debug Metadata
-
-Each `task_validation_snapshotted` event should record:
-
- validation result;
- validation status;
- attempt index;
- evidence run ids;
- evidence session ids;
- tool result count;
- evidence character length;
- validator raw response;
- rendered validation input or prompt, unless a debug setting disables full prompt storage.
-
-This makes future investigations direct: inspect the exact input the validator saw before interpreting its decision.
-
-## Log Snapshot Size
-
-`llm_request_snapshotted` currently stores complete messages and complete tool schemas in both payload and content. That makes logs large and slows inspection.
-
-Default behavior should change to store a compact payload:
-
- iteration;
- provider name and model;
- message count;
- tool names;
- message character length;
- tool schema character length;
- max tokens, temperature, thinking flag.
-
-Full request snapshots should be controlled by a debug config flag. This does not reduce validation evidence because evidence comes from transcript and tool result events.
-
-## Testing Plan
-
-Add or update focused unit tests:
-
-1. Validation evidence is not fixed-truncated. A fact after the first 500 characters of a tool result still appears in the validator input.
-2. Missing evidence returns `insufficient_evidence` and moves the task to `needs_review`, not `failed`.
-3. A team node that ends with `max_tool_iterations` preserves tool evidence in `NodeRunResult.evidence`.
-4. Team final synthesis defaults to `tools=None` and receives rendered team evidence.
-5. Parallel team nodes start concurrently under a bounded semaphore and results remain in graph order.
-6. Tool loop finalization produces a user-visible answer instead of the placeholder stop message.
-7. Status transitions cover `accepted -> awaiting_feedback`, `insufficient_evidence -> needs_review`, `validator_error -> needs_review`, and terminal `failed`.
-8. Validation debug events include evidence metadata and validator raw response.
-
-## Migration Notes
-
-To reduce risk, implement in layers:
-
-1. Add evidence models and builders without changing behavior.
-2. Attach evidence to team node results.
-3. Switch final synthesis for team plans to no-tools evidence synthesis.
-4. Switch validation to evidence packets and new statuses.
-5. Add no-tools finalization for tool iteration limits.
-6. Add limited isolated-loop parallel execution.
-7. Slim `llm_request_snapshotted` behind a debug flag.
-
-This order keeps each change testable and lets the old transcript-summary path remain as a temporary fallback while evidence packets are introduced.
--- a/docs/superpowers/specs/2026-05-26-task-detail-live-execution-design.md
+++ b/docs/superpowers/specs/2026-05-26-task-detail-live-execution-design.md
@ -1,440 +0,0 @@
-# Task Detail Live Execution Design
-
-## Purpose
-
-Task detail should be a live execution surface for ordinary users. It should answer "what is Beaver doing now?", "what has already happened?", "what changed because of a tool or agent result?", and "what can I inspect or accept?" without forcing the user to wait for a final answer.
-
-This page is not primarily a developer audit view. It should expose enough execution detail to create confidence, while keeping raw payloads, long tool output, and debug metadata behind progressive disclosure.
-
-## User Experience Principles
-
- Show progress as a chronological card feed that grows while the task runs.
- Prefer user-facing explanations over raw internal event names.
- Show skill selection, tool usage, tool result, agent team activity, artifacts, and final result as first-class cards.
- Do not expose hidden chain-of-thought. Use brief action summaries such as "Beaver found the relevant files and will now inspect the API response shape."
- Keep the user oriented with a persistent task header and clear current status.
- Stop live updates once the task reaches a terminal state, while still allowing manual refresh.
-
-## Page Layout
-
-### Persistent Header
-
-The top header remains visible while scrolling and contains:
-
- task title
- task status: open, running, awaiting acceptance, needs revision, closed, abandoned, error, or cancelled
- current stage label
- elapsed time
- compact progress summary
- link back to task list
- link to source conversation
- acceptance entry point when a run is ready for review
-
-### Main Timeline
-
-The main column is a chronological card feed. Cards append as execution events arrive.
-
-Expected card sequence:
-
-1. task created
-2. planning started or completed
-3. skill selected
-4. tool call started
-5. tool call finished
-6. model next step
-7. agent team created
-8. sub-agent started
-9. sub-agent progress
-10. agent handoff
-11. sub-agent finished
-12. artifact created
-13. result ready
-14. acceptance recorded
-
-Cards should visually appear in order and keep enough prior context visible so the page feels like a live work log rather than a static report.
-
-### Side Rail
-
-The side rail contains compact, always-accessible context:
-
- agent team map
- currently active agent or tool
- artifacts list
- latest warning or blocked state
- acceptance state
-
-On small screens, the side rail collapses below the header or into tabs.
-
-## Card Types
-
-### Task Created Card
-
-Shows that Beaver recognized the user message as a task.
-
-Fields:
-
- task goal
- source session
- created time
- initial status
-
-### Plan Card
-
-Shows the execution approach.
-
-Fields:
-
- mode: single agent or agent team
- planned stages
- attempt index
- strategy summary
-
-### Skill Card
-
-Shows which skill Beaver selected and why it matters.
-
-Fields:
-
- skill name
- skill version if available
- user-facing reason
- capabilities or method guidance summary
-
-If multiple skills are selected, render one grouped card with individual rows.
-
-### Tool Call Card
-
-Shows that Beaver is using a tool.
-
-Fields:
-
- tool name
- action summary
- actor name
- status: running, done, failed
- started time
- duration if completed
-
-Raw tool arguments are hidden by default.
-
-### Tool Result Card
-
-Shows what the tool found or produced.
-
-Fields:
-
- success or failure
- result summary
- error message if any
- links to artifact or output
- expandable raw result
-
-### Next Step Card
-
-Shows Beaver's next user-visible action after interpreting a result.
-
-Fields:
-
- short action explanation
- related prior card or run
- expected next event type when known
-
-This card must not contain private reasoning traces.
-
-### Agent Team Card
-
-Shows that Beaver created a multi-agent team.
-
-Fields:
-
- team strategy
- agent count
- dependency shape
- agent names and assigned tasks
-
-### Sub-Agent Card
-
-Shows progress from an individual agent.
-
-Fields:
-
- agent name
- assigned task
- current status
- progress text
- latest output summary
-
-### Agent Handoff Card
-
-Shows interaction between agents.
-
-Fields:
-
- source agent
- target agent
- handoff reason
- summary of transferred result
-
-### Artifact Card
-
-Shows an output created during execution.
-
-Fields:
-
- artifact title
- artifact type
- source agent or run
- created time
- open or download action
- summary or preview where safe
-
-### Error or Blocked Card
-
-Shows that execution hit a problem.
-
-Fields:
-
- problem summary
- affected stage or tool
- whether Beaver can continue automatically
- action required from user if any
-
-### Final Result Card
-
-Shows the result that the user can review.
-
-Fields:
-
- final answer or result summary
- important artifacts
- validation or evidence status when available
- accept, revise, and abandon actions
-
-## Realtime Behavior
-
-### Live Updates
-
-The page should subscribe to task-related process events while the task is active. The following updates should append or update cards in real time:
-
- skill selected
- tool call started
- tool call finished
- agent team created
- sub-agent started
- sub-agent progress
- sub-agent finished
- agent handoff
- artifact created
- task result ready
- task error or blocked state
- acceptance recorded
-
-### Initial Load
-
-On page load, call `GET /api/tasks/{task_id}` and hydrate:
-
- task metadata
- lifecycle events
- process runs
- process events
- process artifacts
- readable run messages
- existing feedback
-
-The frontend should build the initial card feed from these persisted records so a refreshed page reconstructs the same execution timeline.
-
-### Fallback Polling
-
-If WebSocket updates are unavailable, active tasks should poll `GET /api/tasks/{task_id}` every 3 to 5 seconds.
-
-Polling stops when the task reaches a terminal state:
-
- closed
- abandoned
- cancelled
- error
-
-Manual refresh remains available.
-
-### Large Content Loading
-
-The following content should not be loaded or expanded by default:
-
- raw tool arguments
- full tool output
- raw process event payloads
- full transcript
- memory retrieval trace
- debug metadata
-
-These belong behind "show details" controls or a later advanced view.
-
-## Backend Event Contract
-
-The existing task detail API already exposes useful primitives:
-
- `process_runs`
- `process_events`
- `process_artifacts`
- `runs`
- `events`
- `skill_names`
- task metadata and feedback
-
-For a reliable user-facing timeline, backend events should become more explicit. Recommended event kinds:
-
- `task_created`
- `task_planned`
- `skill_selected`
- `tool_call_started`
- `tool_call_finished`
- `agent_team_created`
- `agent_started`
- `agent_progress`
- `agent_handoff`
- `agent_finished`
- `artifact_created`
- `task_result_ready`
- `task_acceptance_recorded`
- `task_error`
-
-Each event should include:
-
- `event_id`
- `task_id`
- `run_id` when applicable
- `parent_run_id` when applicable
- `actor_type`
- `actor_name`
- `kind`
- `status`
- `text`
- `created_at`
- compact `metadata`
-
-Metadata should contain structured fields for rendering, not only raw provider or tool payloads.
-
-## Frontend Rendering Model
-
-The frontend should normalize events into a `TaskTimelineCard` view model.
-
-Recommended fields:
-
-```ts
-type TaskTimelineCard = {
-  id: string;
-  taskId: string;
-  runId?: string | null;
-  parentRunId?: string | null;
-  type:
-    | 'task_created'
-    | 'plan'
-    | 'skill'
-    | 'tool_call'
-    | 'tool_result'
-    | 'next_step'
-    | 'agent_team'
-    | 'agent_progress'
-    | 'agent_handoff'
-    | 'artifact'
-    | 'error'
-    | 'result'
-    | 'acceptance';
-  title: string;
-  summary?: string;
-  actorName?: string;
-  status?: string;
-  createdAt: string;
-  relatedArtifactIds?: string[];
-  details?: Record<string, unknown>;
-};
-```
-
-This keeps rendering stable even if backend event payloads evolve.
-
-## Empty, Loading, and Error States
-
-### No Events Yet
-
-Show a task created card and a running placeholder:
-
-"Beaver is preparing the first step."
-
-### Waiting on Tool
-
-Show the active tool call card with a spinner and elapsed time.
-
-### Waiting on Agent
-
-Show the active agent card with its assigned task and current status.
-
-### Failed Tool
-
-Show an error card with a concise reason and whether Beaver is retrying or changing approach.
-
-### Lost Connection
-
-Keep existing cards visible and show a small reconnecting indicator. If reconnect fails, fall back to polling.
-
-## Acceptance Flow
-
-The final result card is the primary acceptance surface.
-
-Actions:
-
- Accept: closes the task and can trigger skill learning.
- Needs revision: requires a comment, appends a new revision card, and starts another attempt in the same timeline.
- Abandon: closes the task as abandoned and preserves the execution history.
-
-After any acceptance action, the page should immediately update local UI state and refetch the task detail.
-
-## V1 Scope
-
-V1 includes:
-
- persistent task header
- live chronological card feed
- skill cards
- tool call and result cards
- agent team card
- sub-agent progress cards
- artifact cards
- final result and acceptance card
- WebSocket-first updates with polling fallback
- collapsed raw details
-
-V1 excludes:
-
- full administrator audit mode
- memory retrieval graph visualization
- raw provider request/response viewer
- advanced event payload debugger
- editable task graph
-
-## Implementation Notes
-
-The existing `tasks/[taskId]/page.tsx` already has useful pieces, but the main hierarchy should shift from phase groups and selected node detail to a timeline-first experience.
-
-Likely frontend modules:
-
- `TaskLiveHeader`
- `TaskTimeline`
- `TaskTimelineCard`
- `TaskSideRail`
- `TaskAcceptanceCard`
- `buildTaskTimelineCards`
-
-Likely backend work:
-
- emit explicit process events for skill selection and tool calls
- include user-facing text summaries in event metadata
- ensure task detail reconstruction uses persisted events
- expose enough run and actor metadata for agent team rendering
-
-## Self-Review
-
- No placeholders remain.
- The design is scoped to ordinary-user task detail, not admin audit.
- Realtime requirements distinguish live updates from expandable heavy details.
- Backend event requirements are explicit enough for frontend implementation.
- V1 scope avoids memory graph and debug payload work.
--- a/docs/superpowers/specs/2026-06-01-terminal-websocket-channel-design.md
+++ b/docs/superpowers/specs/2026-06-01-terminal-websocket-channel-design.md
@ -1,279 +0,0 @@
-# Terminal WebSocket Channel Design
-
-Date: 2026-06-01
-
-## Goal
-
-Add a text-only WebSocket channel adapter so a small terminal device can connect to Beaver and exchange messages through the channel runtime.
-
-This is a first-stage acceptance path for proving Beaver can talk to the terminal device. The terminal must enter through `ChannelRuntime` and `MessageBus`; it must not use the existing Web UI `/ws/{session_id}` direct-chat path.
-
-## Non-Goals
-
- Do not implement audio, camera, screen, image, or multimodal payloads.
- Do not stream token deltas to the terminal in this phase.
- Do not add AuthZ or device registration in this phase.
- Do not implement the Hermes LiveKit LLM adapter in this phase.
- Do not route terminal messages directly to `AgentService`.
-
-## Recommended Architecture
-
-Add a channel-native WebSocket adapter named `TerminalWebSocketAdapter`.
-
-The Web backend exposes:
-
-```text
-/api/channels/{channel_id}/ws
-```
-
-The route resolves the configured channel adapter from `ChannelRuntime` and delegates the accepted WebSocket to the adapter. The adapter owns terminal connection state, normalizes incoming frames into `InboundMessage`, and receives `OutboundMessage` objects through `ChannelManager.dispatch_outbound()`.
-
-The path remains bus-first:
-
-```text
-terminal websocket
-> TerminalWebSocketAdapter
-> ChannelRuntime.accept_inbound()
-> MessageBus.inbound
-> ChannelRuntime bridge
-> AgentService.handle_inbound_message()
-> MessageBus.outbound
-> ChannelManager.dispatch_outbound()
-> TerminalWebSocketAdapter.send()
-> terminal websocket
-```
-
-## Channel Configuration
-
-The terminal channel uses the existing `BeaverConfig.channels` map.
-
-Example:
-
-```json
-{
-  "channels": {
-    "terminal-dev": {
-      "enabled": true,
-      "kind": "terminal",
-      "mode": "websocket",
-      "accountId": "local",
-      "displayName": "Terminal Dev",
-      "config": {
-        "heartbeatSeconds": 30,
-        "maxMessageChars": 20000
-      }
-    }
-  }
-}
-```
-
-`kind` is the platform family. `mode` is the transport mode. The adapter factory must instantiate `TerminalWebSocketAdapter` when `kind == "terminal"` and `mode == "websocket"`.
-
-## Protocol
-
-The protocol is JSON over WebSocket. All payloads are text-only.
-
-The terminal starts with a connect frame:
-
-```json
-{
-  "type": "connect",
-  "peer_id": "device-001",
-  "device_name": "desk-terminal",
-  "capabilities": ["text"]
-}
-```
-
-Beaver replies:
-
-```json
-{
-  "type": "connected",
-  "channel_id": "terminal-dev",
-  "session_id": "terminal-dev:local:device-001"
-}
-```
-
-The terminal sends user text:
-
-```json
-{
-  "type": "message",
-  "message_id": "m-001",
-  "text": "你好"
-}
-```
-
-Beaver acknowledges accepted inbound:
-
-```json
-{
-  "type": "ack",
-  "message_id": "m-001",
-  "session_id": "terminal-dev:local:device-001",
-  "accepted": true
-}
-```
-
-Beaver sends the final assistant response:
-
-```json
-{
-  "type": "message",
-  "role": "assistant",
-  "message_id": "m-001",
-  "run_id": "run-id",
-  "text": "你好，我在。",
-  "finish_reason": "stop"
-}
-```
-
-Ping/pong frames are supported:
-
-```json
-{"type": "ping"}
-{"type": "pong"}
-```
-
-Unsupported frame types return an error frame and keep the connection open:
-
-```json
-{"type": "error", "error": "Unsupported websocket frame type: example"}
-```
-
-## Identity And Session Mapping
-
-The adapter builds a `ChannelIdentity` from the connect and message frames:
-
- `channel_id`: path/config channel id, such as `terminal-dev`
- `kind`: `terminal`
- `account_id`: channel config account id, such as `local`
- `peer_id`: terminal `peer_id`
- `peer_type`: `terminal`
- `message_id`: message frame `message_id`
- `thread_id`: optional message or connect frame field
- `user_id`: optional message or connect frame field
-
-The session id stays aligned with channel runtime v1:
-
-```text
-<channel_id>:<account_id>:<peer_id>[:<thread_id>]
-```
-
-For the first terminal rollout, a terminal connection is treated as one active peer. A reconnect with the same `peer_id` reuses the same session id.
-
-## Delivery Semantics
-
-Inbound messages are accepted through `ChannelRuntime.accept_inbound()`.
-
-If dedupe sees a duplicate message id:
-
- return an ack with `duplicate: true`
- include cached `reply` when the prior run is done
- include `pending: true` when the prior run is still processing
- do not publish a second inbound message
-
-Outbound delivery is connection-bound. `TerminalWebSocketAdapter.send()` looks up the active connection for the outbound session or peer. If found, it sends the final assistant message. If no connection is available, it marks the outbound message as unclaimed so runtime records `outbound_unclaimed`.
-
-No retry queue is required in this phase.
-
-## Runtime Status And Events
-
-`/api/status` and `/api/channels` include terminal channels with:
-
- `channel_id`
- `kind`
- `mode`
- `display_name`
- `enabled`
- `state`
- `account_id`
- `last_event_at`
- `websocket_url`
- `capabilities`, including `receive_text`, `send_text`, and `persistent_connection`
- `connected_peers`
-
-Channel events should record:
-
- `adapter_started`
- `terminal_connected`
- `terminal_disconnected`
- `inbound_accepted`
- `inbound_duplicate`
- `direct_run_started`
- `direct_run_finished`
- `outbound_delivered`
- `outbound_unclaimed`
- `adapter_stopped`
-
-Do not store raw terminal payloads or full message text in the event log. Existing text preview behavior is enough.
-
-## Nginx And Deployment
-
-The existing `/api/channels/` nginx location must support WebSocket upgrade because terminal WebSockets live under that prefix.
-
-The location should include:
-
-```nginx
-proxy_http_version 1.1;
-proxy_set_header Upgrade $http_upgrade;
-proxy_set_header Connection $connection_upgrade;
-proxy_read_timeout 3600;
-proxy_send_timeout 3600;
-```
-
-The 1800 second timeout used by synchronous webhooks can stay, but WebSocket upgrade headers are required for terminal devices.
-
-## Error Handling
-
-Before connect:
-
- only `connect` and `ping` are accepted
- `message` returns an error requiring connect first
-
-On connect:
-
- missing `peer_id` closes or rejects with an error frame
- unsupported capabilities are ignored for now as long as text is available
-
-On message:
-
- missing `message_id` returns an error
- missing or blank `text` returns an error
- oversized text returns an error based on `max_message_chars`
-
-On disconnect:
-
- remove the active connection
- record `terminal_disconnected`
- do not cancel an already running Beaver direct run
-
-If the run completes after disconnect, outbound is recorded as `outbound_unclaimed`.
-
-## Testing
-
-Add focused backend tests:
-
- WebSocket connect returns `connected` with stable session id.
- Message frame publishes through runtime and returns ack plus assistant message.
- Duplicate message id does not publish a second inbound and returns duplicate status.
- Disconnect before outbound records `outbound_unclaimed`.
- Unknown frame type returns an error and keeps the connection alive.
- Channel status exposes `websocket_url` and connected peer count.
- Config loader accepts `kind=terminal`, `mode=websocket` through existing channel config.
-
-Run the existing backend unit suite and frontend type/test checks after implementation.
-
-## Acceptance Criteria
-
-The first-stage acceptance is complete when a small terminal can:
-
-1. Connect to `/api/channels/terminal-dev/ws`.
-2. Send a `connect` frame with a stable `peer_id`.
-3. Send a text `message` frame.
-4. Receive an ack.
-5. Receive the final assistant text response from Beaver.
-6. Reconnect with the same `peer_id` and keep the same Beaver session id.
-7. Show connection and message events in Beaver channel status/events.
-
-This validates the Beaver-to-terminal path through the new channel runtime without introducing AuthZ, multimodal payloads, or Hermes LiveKit LLM work.
--- a/docs/superpowers/specs/2026-06-02-channel-connectors-and-pairing-design.md
+++ b/docs/superpowers/specs/2026-06-02-channel-connectors-and-pairing-design.md
@ -1,404 +0,0 @@
-# Channel Connectors And Pairing Design
-
-Date: 2026-06-02
-
-## Goal
-
-Add a first-class connection layer above Beaver's channel runtime so users can connect messaging platforms through plugin, QR, OAuth, token, or app-credential flows instead of editing static channel JSON by hand.
-
-This design reframes platform channels as two cooperating layers:
-
-```text
-ChannelConnector
-> install / auth / QR / OAuth / credential validation / login state
-> ChannelConnectionStore
-> ChannelRuntime
-> ChannelAdapter or ExternalConnectorChannel
-> MessageBus
-> AgentService
-```
-
-The existing `ChannelRuntime`, `MessageBus`, `ChannelManager`, and `ChannelAdapter` contracts remain the message routing core. The new connector layer owns user-visible setup and connection lifecycle. For platforms backed by predeclared sidecar services, Beaver should expose the sidecar to the runtime as an `ExternalConnectorChannel` rather than a Beaver-owned platform protocol adapter.
-
-## Why This Is Required
-
-The current channel design assumes a channel is already configured before the backend starts. That is enough for local development and simple webhook/token channels, but it does not match real platform onboarding:
-
- Feishu/Lark now has a Channel SDK pattern that packages bot channel setup, WebSocket or webhook transport, event handling, and replies around an installed app identity.
- Weixin personal-account setup uses a docker-compose predeclared sidecar connector plus QR login and persistent login state.
- Terminal devices need pairing or device registration; a raw `peer_id` connect frame is not enough for a real deployment.
- Even simple token platforms such as Telegram need a UI flow for token entry, validation, status, revoke, and restart.
-
-So Beaver needs a connection lifecycle layer. Adapters should not be responsible for prompting the user, installing packages, storing long-lived credentials, or deciding whether an unknown device is allowed to bind.
-
-## Non-Goals
-
- Do not replace `ChannelRuntime`, `MessageBus`, `ChannelManager`, or `AgentService`.
- Do not make every connector a Node sidecar. Node sidecars are allowed when the official or practical SDK path requires them.
- Do not implement every channel in this phase.
- Do not build a plugin marketplace in this phase.
- Do not store platform secrets in plain channel config when a credential store is available.
- Do not let external connector code call `AgentService` directly.
-
-## Core Terms
-
-`ChannelConnection` is the user-visible connection instance. Examples: "Weixin personal account", "Lark workspace bot", "Telegram main bot", "Desk terminal".
-
-`ChannelConnector` is the setup and lifecycle controller for one platform family. It starts pairing sessions, validates credentials, checks preconfigured connector endpoints when needed, handles reconnects, and emits runtime channel config.
-
-`ChannelAdapter` is the message transport adapter used by `ChannelRuntime`. It receives normalized inbound messages and sends outbound replies. It does not own onboarding.
-
-`ExternalConnectorChannel` is the runtime channel object used when a platform protocol lives outside the Python backend. It implements the same `start()`, `stop()`, and `send()` contract as an adapter, but its `send()` method calls an external connector HTTP API and inbound messages enter Beaver through a connector bridge endpoint.
-
-`ExternalConnectorProcess` is an optional preconfigured service for platforms whose SDK or login behavior is better isolated outside the Python backend. For Weixin, this process is a docker-compose predeclared sidecar service. Beaver must not dynamically create containers or require Docker socket access.
-
-## Data Model
-
-Add a durable connection store under the backend workspace:
-
-```python
-@dataclass
-class ChannelConnection:
-    connection_id: str
-    owner_user_id: str | None
-    channel_id: str
-    kind: str
-    mode: str
-    display_name: str
-    account_id: str
-    status: str
-    auth_type: str
-    credentials_ref: str | None
-    connector_ref: str | None
-    pairing_session_id: str | None
-    runtime_config: dict[str, Any]
-    capabilities: list[str]
-    created_at: str
-    updated_at: str
-    last_seen_at: str | None
-    last_error: str | None
-```
-
-`status` values:
-
- `draft`: setup has started but no credentials are usable.
- `pairing`: waiting for QR scan, OAuth callback, device approval, or token validation.
- `connected`: credentials are valid and the runtime channel can start.
- `running`: the runtime adapter or external connector is active.
- `degraded`: partially working, for example inbound works but media upload failed.
- `error`: connection cannot start or authenticate.
- `revoked`: user or platform revoked the connection.
-
-Credential material should live behind `credentials_ref`, not inline in `ChannelConnection`. For the first local implementation, the reference may point to an encrypted file or a restricted JSON store. The interface should still look like a credential vault so AuthZ or a real secret backend can replace it later.
-
-## Connector Contract
-
-Every connector implements a setup contract:
-
-```python
-class ChannelConnector(Protocol):
-    kind: str
-
-    async def start_pairing(request: StartPairingRequest) -> PairingSession
-    async def complete_pairing(event: PairingEvent) -> ChannelConnection
-    async def validate(connection_id: str) -> ValidationResult
-    async def materialize_runtime(connection_id: str) -> ChannelRuntimeSpec
-    async def revoke(connection_id: str) -> None
-```
-
-`materialize_runtime()` returns the adapter-ready config:
-
-```python
-@dataclass
-class ChannelRuntimeSpec:
-    channel_id: str
-    kind: str
-    mode: str
-    account_id: str
-    display_name: str
-    config: dict[str, Any]
-    secrets_ref: str | None
-    external_endpoint: str | None
-```
-
-The runtime may still internally use `ChannelConfig`, but the source of truth becomes `ChannelConnectionStore`, not only static `BeaverConfig.channels`.
-
-## Control APIs
-
-Add backend APIs for the connection UI:
-
-```text
-GET    /api/channel-connectors
-GET    /api/channel-connections
-POST   /api/channel-connections
-GET    /api/channel-connections/{connection_id}
-POST   /api/channel-connections/{connection_id}/pairing/start
-POST   /api/channel-connections/{connection_id}/pairing/complete
-POST   /api/channel-connections/{connection_id}/validate
-POST   /api/channel-connections/{connection_id}/start
-POST   /api/channel-connections/{connection_id}/stop
-POST   /api/channel-connections/{connection_id}/revoke
-GET    /api/channel-connections/{connection_id}/events
-```
-
-The existing `/api/channels` status endpoint can keep reporting runtime adapter status, but the UI should prefer `/api/channel-connections` for setup state.
-
-## UI Flow
-
-The status page becomes a channel connection page:
-
-```text
-Add Channel
-> choose platform
-> connector-specific setup form
-> QR/OAuth/token/app credential validation
-> connection status
-> start runtime channel
-> test message or platform health check
-```
-
-The UI must distinguish:
-
- setup state: pairing, credential validation, revoked.
- runtime state: adapter running, disconnected, outbound failed.
- platform state: QR expired, app not installed, permission missing, token invalid.
-
-This avoids the current problem where all failures collapse into adapter startup errors.
-
-## External Connector Process
-
-Some channels should run through an external process:
-
-```text
-ExternalConnectorProcess
-> Beaver connector control API
-> local Unix/TCP/WebSocket bridge
-> ChannelRuntime ExternalConnectorChannel
-```
-
-The external process must not receive permanent backend admin credentials through QR codes or copied commands. It should receive a short-lived pairing token with a narrow scope:
-
-```text
-scope: channel:pair
-kind: weixin
-expires_in: 10 minutes
-one_time: true
-```
-
-After pairing, Beaver stores the resulting connection credentials and gives the connector a renewable connection token scoped to that connection only. For docker-compose sidecars, that token is passed through the connector HTTP API or service configuration agreed for that sidecar; Beaver does not create or restart the sidecar container.
-
-## Per-Channel Assessment
-
-### Feishu / Lark
-
-Feishu/Lark should be a first-class connector, not only a static adapter.
-
-Recommended first implementation:
-
- connector kind: `feishu`
- setup fields: domain, app id, app secret, connection mode.
- default mode: WebSocket long connection.
- optional mode: webhook.
- runtime adapter: may be Python if coverage is sufficient, or an external Node connector when using official Channel SDK behavior.
-
-Required setup checks:
-
- app credentials are present.
- bot/event permissions are configured.
- event subscription mode is valid.
- bot identity can be resolved.
- a test direct message or event subscription health check can run when available.
-
-The connector should expose both "manual app credential setup" and future "install from app template" paths. The manual path is enough for the first Beaver release.
-
-### Weixin
-
-Weixin should use a docker-compose predeclared sidecar connector.
-
-Recommended first implementation:
-
- connector kind: `weixin`
- setup mode: Beaver calls the sidecar HTTP API to start QR login and poll pairing state.
- external process: required, predeclared in docker-compose, and never dynamically created by Beaver.
- runtime channel: `ExternalConnectorChannel`.
-
-Required setup checks:
-
- sidecar base URL is configured.
- sidecar health endpoint responds.
- connector version is compatible with Beaver.
- QR session is pending, scanned, confirmed, expired, or failed.
- login state is stored behind `credentials_ref`.
- connector heartbeat is visible.
-
-The sidecar owns Weixin protocol handling, QR login, inbound receive, outbound send, and login-state persistence. Beaver owns connector setup state, bridge API validation, message normalization boundaries, runtime dedupe, and outbound HTTP calls to the sidecar `/send` API.
-
-The agreed runtime flow is:
-
-```text
-Weixin sidecar connector
-> Beaver connector bridge endpoint
-> ChannelRuntime.accept_inbound()
-> MessageBus
-> AgentService
-
-AgentService
-> MessageBus outbound
-> ExternalConnectorChannel.send()
-> Weixin sidecar connector /send
-```
-
-Group delivery remains best-effort. The connector must surface group capability separately from direct message capability.
-
-### Telegram
-
-Telegram can be implemented as an internal connector plus internal adapter.
-
-Recommended first implementation:
-
- setup mode: bot token entry.
- validation: call Telegram `getMe`.
- runtime mode: polling by default, webhook optional.
- no external process required.
-
-The UI still treats it as a connector so users can add, validate, revoke, and restart it without editing JSON.
-
-### QQBot
-
-QQBot should start as an internal connector with official gateway credentials.
-
-Recommended first implementation:
-
- setup fields: app id, client secret, intents or permission hints.
- runtime mode: WebSocket gateway.
- validation: token exchange or gateway auth dry run when available.
-
-If SDK/runtime behavior later becomes easier outside Python, this connector can move to an external process without changing the runtime message contract.
-
-### Terminal
-
-Terminal should move from raw `peer_id` to pairing.
-
-Recommended first implementation:
-
- UI creates a terminal pairing session.
- Beaver displays a command or QR/setup code.
- device connects with one-time pairing token.
- Beaver binds a stable device identity to a `ChannelConnection`.
- subsequent WebSocket `connect` frames authenticate as the bound device.
-
-The message protocol can keep `connect`, `connected`, `message`, `ack`, and assistant `message`, but production connections must include an authenticated device token.
-
-## Message Flow After Pairing
-
-Once a connection is paired, the message path stays unchanged:
-
-```text
-platform or device
-> connector transport
-> ChannelAdapter
-> ChannelRuntime.accept_inbound()
-> MessageBus.inbound
-> AgentService.handle_inbound_message()
-> MessageBus.outbound
-> ChannelManager.dispatch_outbound()
-> ChannelAdapter.send()
-> connector transport
-> platform or device
-```
-
-This is intentionally conservative. Pairing changes how a channel becomes trusted and running; it does not change the agent loop.
-
-## Access Control
-
-Connection setup requires a Beaver user or backend owner identity. The connector layer decides who may create, view, revoke, or start a connection.
-
-Inbound platform messages still use adapter-level policy:
-
- `open`: accept platform scope.
- `allowlist`: accept only known users/groups.
- `disabled`: ignore that scope.
-
-The important change is that allowlists belong to the connection settings, not ad hoc adapter config only.
-
-## Error Handling
-
-Pairing errors:
-
- expired pairing token.
- QR not scanned before timeout.
- OAuth callback state mismatch.
- platform permission missing.
- credentials validation failed.
-
-Runtime errors:
-
- adapter startup failed.
- connector process unavailable.
- heartbeat missed.
- inbound normalization failed.
- outbound delivery failed.
-
-Each event should be recorded against `connection_id` and, when available, `channel_id` and `session_id`.
-
-## Security Requirements
-
- Pairing tokens are short-lived, one-time, and scoped to one connector kind.
- QR codes never embed permanent backend credentials.
- External connector processes do not receive broad backend admin tokens.
- Revoking a connection invalidates connector tokens and stops the runtime channel.
- Stored platform credentials are referenced by `credentials_ref`.
- Event logs must not include raw secrets, tokens, QR payloads, or full platform credential responses.
-
-## Relationship To Existing Channel Specs
-
-The terminal WebSocket spec remains valid as a development transport spec, but production terminal setup must add pairing.
-
-The chat platform adapter spec remains valid as a runtime adapter spec, but these statements should be revised before implementation:
-
- "Do not introduce a Node sidecar as the default channel architecture" should become "Use internal adapters by default, but allow external connector processes where platform SDK or login state requires them."
- "Pairing is out of scope for this phase" should become "Pairing is owned by the connector layer; adapters assume a materialized connection."
- Static `BeaverConfig.channels` should become a development override and backward-compatible import path, not the only source of runtime channels.
-
-## Rollout
-
-Implement in this order:
-
-1. `ChannelConnectionStore`, connector registry, and connection status APIs.
-2. Telegram connector as the simplest token-based setup path.
-3. Terminal pairing to remove raw unauthenticated `peer_id` usage.
-4. Feishu/Lark connector with WebSocket long-connection mode and credential validation.
-5. Weixin external connector bridge with QR pairing.
-6. QQBot connector after the common credential and gateway patterns are stable.
-
-This order proves the common connector lifecycle with a low-risk token channel before adding QR and external process complexity.
-
-## Testing
-
-Add unit tests for:
-
- connection store create/update/revoke.
- pairing token expiry and one-time use.
- connector registry dispatch by kind.
- materializing runtime specs from connections.
- secret redaction in events.
- adapter runtime still receiving normalized `InboundMessage`.
-
-Add integration-style tests with fake connectors for:
-
- successful token setup.
- QR expired and QR completed.
- external connector heartbeat loss.
- revoke stops runtime dispatch.
-
-Live platform tests remain manual or gated behind explicit environment variables.
-
-## Acceptance Criteria
-
- A user can add a channel connection without editing backend JSON.
- Beaver can show setup state separately from runtime adapter state.
- Telegram can validate a bot token and materialize a runtime channel.
- Terminal can bind through a one-time pairing flow.
- Feishu/Lark design allows official SDK or Node connector use when needed.
- Weixin design requires an external connector and QR login state.
- Existing channel runtime message flow remains bus-first and adapter-mediated.
--- a/docs/superpowers/specs/2026-06-02-chat-platform-channel-adapters-design.md
+++ b/docs/superpowers/specs/2026-06-02-chat-platform-channel-adapters-design.md
@ -1,307 +0,0 @@
-# Chat Platform Channel Adapters Design
-
-Date: 2026-06-02
-
-## Goal
-
-Add first-class Beaver channel adapters for four messaging platforms:
-
- `FeishuAdapter`
- `QQBotAdapter`
- `TelegramAdapter`
- `ExternalConnectorChannel` for Weixin personal-account sidecars
-
-Each runtime channel must plug into the existing `ChannelRuntime`, normalize inbound platform messages into `InboundMessage` with `ChannelIdentity`, and deliver `OutboundMessage` replies back to the original platform conversation. Feishu, QQBot, and Telegram use Beaver-owned protocol adapters. Weixin personal-account support uses a docker-compose predeclared sidecar connector, so Beaver exposes it as an `ExternalConnectorChannel` rather than a Beaver-owned `WeixinAdapter`.
-
-## Non-Goals
-
- Use internal adapters by default, but allow external connector processes where platform SDK or login state requires them.
- Do not implement WhatsApp in this phase.
- Do not replace `ChannelRuntime`, `MessageBus`, or `ChannelManager`.
- Do not move platform access policy into `AgentService`.
- Do not implement streaming token deltas for these channels in this phase.
- Do not promise stable Weixin group support; Weixin group delivery is best-effort only.
-
-## Architecture
-
-Keep Beaver's channel runtime as the owner of lifecycle, dedupe, event logging, and agent dispatch.
-
-```text
-platform SDK/API or sidecar connector
-> {Channel}Adapter or ExternalConnectorChannel bridge endpoint
-> ChannelRuntime.accept_inbound()
-> MessageBus.inbound
-> ChannelRuntime bridge
-> AgentService.handle_inbound_message()
-> MessageBus.outbound
-> ChannelManager.dispatch_outbound()
-> {Channel}Adapter.send() or ExternalConnectorChannel.send()
-> platform SDK/API or sidecar connector API
-```
-
-Adapters own platform-specific transport and delivery details when Beaver directly integrates a platform API. For Weixin, the sidecar owns the platform protocol, QR login, receive loop, send behavior, and login-state persistence. The runtime owns Beaver session identity, dedupe, event logging, and run dispatch in both cases.
-
-## Shared Adapter Contract
-
-Each runtime channel implements the existing `ChannelAdapter` protocol:
-
-```python
-channel_id: str
-kind: str
-mode: str
-
-async def start() -> None
-async def stop() -> None
-async def send(message: OutboundMessage) -> None
-```
-
-Each Beaver-owned adapter receives a `ChannelInboundSink` and calls `accept_inbound()` for every normalized user message. `ExternalConnectorChannel` receives inbound Weixin messages through Beaver's connector bridge endpoint, then submits normalized messages to `ChannelRuntime.accept_inbound()`.
-
-For all four adapters:
-
- `kind` is one of `feishu`, `qqbot`, `telegram`, `weixin`.
- `account_id` comes from channel config.
- inbound messages must include `ChannelIdentity`.
- outbound replies route by `message.channel_identity` when present, falling back to `message.session_id`.
- unsupported media is represented as text metadata in phase one rather than dropped silently.
-
-## Channel Configuration
-
-All channels use the existing `BeaverConfig.channels` map.
-
-```json
-{
-  "channels": {
-    "telegram-main": {
-      "enabled": true,
-      "kind": "telegram",
-      "mode": "polling",
-      "accountId": "bot-main",
-      "displayName": "Telegram Main",
-      "secrets": {
-        "botToken": "..."
-      },
-      "config": {
-        "requireMentionInGroups": true,
-        "maxMessageChars": 4096
-      }
-    }
-  }
-}
-```
-
-Config keys stay channel-specific inside `config` and `secrets`. The factory chooses the adapter by `kind` and `mode`.
-
-For sidecar-backed channels, config also includes the connector base URL and bridge settings. Beaver must call the already-running connector HTTP API and must not dynamically create containers or require Docker socket access.
-
-## Identity Mapping
-
-All adapters map platform identity into `ChannelIdentity`:
-
- `channel_id`: configured Beaver channel id, such as `telegram-main`
- `kind`: platform kind
- `account_id`: configured account id
- `peer_id`: platform chat, group, openid, or user conversation id
- `thread_id`: platform topic/thread id when applicable
- `peer_type`: `dm`, `group`, `channel`, or platform-specific value
- `user_id`: platform sender id when available
- `message_id`: platform message id or event id
-
-The runtime continues to derive sessions as:
-
-```text
-<channel_id>:<account_id>:<peer_id>[:<thread_id>]
-```
-
-Group sessions can later become per-user or per-thread by adding adapter-level `thread_id` rules without changing `ChannelRuntime`.
-
-## Adapter Scope
-
-### FeishuAdapter
-
-Supports:
-
- WebSocket long connection as the preferred mode.
- Optional webhook mode if configured.
- Direct messages.
- Group messages gated by mention or config.
- Text outbound replies.
- Basic inbound media metadata and cached local file paths when available.
-
-Configuration:
-
- `secrets.appId`
- `secrets.appSecret`
- `config.domain`: `feishu` or `lark`
- `config.connectionMode`: `websocket` or `webhook`
- `config.requireMentionInGroups`
- `config.allowFrom`
- `config.groupAllowFrom`
-
-### QQBotAdapter
-
-Supports:
-
- Official QQ Bot WebSocket gateway for inbound events.
- Official REST API for outbound text replies.
- Private C2C messages.
- Group messages.
- Guild/channel messages when the platform event provides them.
- Basic rich media intake as cached local files or text metadata.
-
-Configuration:
-
- `secrets.appId`
- `secrets.clientSecret`
- `config.markdownSupport`
- `config.dmPolicy`
- `config.allowFrom`
- `config.groupPolicy`
- `config.groupAllowFrom`
-
-### TelegramAdapter
-
-Supports:
-
- Bot API long polling as the default mode.
- Optional webhook mode if configured.
- Direct messages.
- Group messages gated by mention or config.
- Text replies with platform-safe formatting and chunking.
- Photo/document/audio/video intake as cached local files or metadata.
-
-Configuration:
-
- `secrets.botToken`
- `config.mode`: `polling` or `webhook`
- `config.webhookUrl`
- `config.webhookSecret`
- `config.requireMentionInGroups`
- `config.allowFrom`
- `config.groupAllowFrom`
- `config.maxMessageChars`
-
-### ExternalConnectorChannel For Weixin
-
-Supports:
-
- Docker-compose predeclared sidecar connector.
- QR-login sessions started and observed through the sidecar HTTP API.
- Direct messages.
- Text replies sent through the sidecar `/send` API.
- Media send/receive when the sidecar provides normalized metadata.
- Group delivery as best-effort only.
-
-Configuration:
-
- `secrets.connectionToken`
- `config.accountId`
- `config.baseUrl`
- `config.bridgeSecret`
- `config.dmPolicy`
- `config.allowFrom`
- `config.groupPolicy`
- `config.groupAllowFrom`
- `config.maxMessageChars`
-
-Inbound flow:
-
-```text
-Weixin sidecar connector
-> Beaver connector bridge endpoint
-> ChannelRuntime.accept_inbound()
-> MessageBus
-> AgentService
-```
-
-Outbound flow:
-
-```text
-AgentService
-> MessageBus outbound
-> ExternalConnectorChannel.send()
-> Weixin sidecar connector /send
-```
-
-The sidecar is the Weixin protocol adapter. Beaver's `ExternalConnectorChannel` only validates bridge calls, normalizes the sidecar event boundary, preserves runtime dedupe/session semantics, and forwards outbound sends to the sidecar HTTP API.
-
-## Access Control
-
-Adapters may block inbound messages before calling `accept_inbound()` when the platform has channel-native allowlist settings. Runtime dedupe still applies after adapter admission.
-
-Initial policy values:
-
- `open`: allow matching platform scope.
- `allowlist`: require `allowFrom` or `groupAllowFrom`.
- `disabled`: ignore inbound messages for that scope.
-
-Pairing is owned by the connector layer. Platform adapters assume a materialized `ChannelConnection` and adapter-ready runtime config. For Weixin personal-account support, the runtime channel is an `ExternalConnectorChannel`, not a Beaver-owned `WeixinAdapter`.
-
-## Delivery Semantics
-
-Inbound:
-
- validate required routing fields before submitting to runtime.
- preserve raw platform payload in metadata only when useful for debugging.
- keep metadata small enough for event logs.
- include media paths in metadata and text summaries in `content` when the agent needs to know an attachment exists.
-
-Outbound:
-
- send only final assistant replies in phase one.
- chunk messages to platform limits.
- mark `delivery_status = "unclaimed"` when a target cannot be resolved.
- raise or return delivery failures so `ChannelManager` records `outbound_delivery_failed`.
-
-## Runtime Status
-
-`ChannelRuntime.statuses()` should report platform channels with:
-
- `channel_id`
- `kind`
- `mode`
- `display_name`
- `enabled`
- `state`
- `account_id`
- `last_error`
- `last_event_at`
- `capabilities`
-
-Capabilities are conservative:
-
- Feishu: `receive_text`, `send_text`, `receive_media`, `groups`
- QQBot: `receive_text`, `send_text`, `receive_media`, `groups`
- Telegram: `receive_text`, `send_text`, `receive_media`, `groups`
- Weixin: `receive_text`, `send_text`, `receive_media`, `direct_messages`
-
-## Error Handling
-
- Adapter startup failure sets channel state to `error` and does not stop other channels.
- Runtime shutdown calls every adapter `stop()`.
- Platform transient errors should retry inside the adapter only when retrying cannot duplicate user-visible sends.
- Fatal credential/config errors should surface in channel status.
- Inbound duplicates are handled by existing `ChannelDedupeStore`.
-
-## Testing
-
-Add tests in small layers:
-
- factory tests for `kind` and `mode` adapter selection.
- identity normalization tests for each platform.
- inbound adapter tests using fake platform payloads.
- outbound adapter tests with fake platform clients.
- runtime status tests for configured enabled/disabled/error channels.
-
-Network live tests are out of scope for unit tests. Adapter constructors should accept injectable clients or lightweight transport functions so tests do not call real platform APIs.
-
-## Rollout
-
-Implement one adapter at a time:
-
-1. Telegram
-2. Feishu
-3. QQBot
-4. Weixin
-
-Telegram is first because its bot-token flow and text path are the simplest proof of the shared adapter pattern. Weixin is last because QR/login state, context tokens, and media handling are more specialized.
--- a/docs/superpowers/specs/2026-06-02-external-sidecar-connectors-design.md
+++ b/docs/superpowers/specs/2026-06-02-external-sidecar-connectors-design.md
@ -1,592 +0,0 @@
-# External Sidecar Connectors Design
-
-Date: 2026-06-02
-
-## Goal
-
-Add real Weixin personal-account QR login and Feishu/Lark plugin onboarding to Beaver through a docker-compose predeclared sidecar service, without binding Beaver's connector layer to one vendor runtime. Beaver must not dynamically create containers or require Docker socket access.
-
-This design implements the next connector layer after `docs/superpowers/plans/2026-06-02-channel-connectors-foundation.md`.
-
-## Design Corrections
-
-This design intentionally fixes four architecture constraints before implementation:
-
- The sidecar is generic. Beaver depends on a connector HTTP contract, not on one vendor runtime.
- Pairing is modeled as a broader `ConnectorSession`, because Feishu/Lark install/link flows are not only QR pairing.
- Bridge events include `eventId`, `timestamp`, and `deliveryAttempt`, and Beaver dedupes bridge events before they can trigger duplicate agent replies.
- Bridge authentication is service-level in the first version. The shared connector token lives in environment variables, not per-connection credentials.
- Outbound sidecar sends include a required `requestId` so sidecar retries are idempotent.
- Connected sessions dynamically register runtime channels. A successful Weixin or Feishu/Lark connection must not require a Beaver restart.
-
-## Scope
-
-Included:
-
- A repo-local `external-connector` sidecar service.
- A docker-compose service declaration for the sidecar.
- A sidecar `ConnectorProvider` abstraction.
- A production `VendorCliProvider` that runs the real vendor CLI/plugin commands required for Weixin personal-account QR login and Feishu/Lark plugin onboarding.
- Sidecar HTTP API for health, connector metadata, connector sessions, logout/remove, outbound send, and inbound event forwarding.
- Beaver `WeixinConnector` and `FeishuConnector` objects registered in `ChannelConnectorRegistry`.
- Beaver connector bridge endpoints that accept normalized sidecar inbound events and submit them to `ChannelRuntime.accept_inbound()`.
- `MessageDedupeStore` for connector bridge event idempotency.
- `ExternalConnectorChannel` runtime object for sidecar-backed outbound sends.
- `ChannelRuntime.add_channel()` and `ChannelRuntime.remove_channel()` for dynamic runtime activation.
- Web UI connection wizard for Weixin QR login and Feishu/Lark plugin onboarding.
- Unit tests using fake sidecar providers and bridge events.
-
-Excluded:
-
- Dynamic Docker container creation from Beaver.
- Docker socket mounts in Beaver.
- Reimplementing Weixin iLink or Feishu/Lark plugin protocols inside Beaver.
- Building a generic plugin marketplace.
- Multi-user enterprise permission governance beyond local connector ownership and bridge token validation.
-
-## Architecture
-
-Use one predeclared sidecar for external connector providers:
-
-```text
-Beaver backend
-> Connector HTTP client
-> external-connector sidecar
-> ConnectorProvider
-> provider-specific runtime or CLI
-> Weixin / Feishu / future platform
-```
-
-Beaver owns:
-
- connection state in `ChannelConnectionStore`
- credential references in `CredentialStore`
- connector session state exposed to the web UI
- service-level connector authentication
- bridge event dedupe
- normalized runtime message admission
- runtime channel lifecycle
- runtime dedupe/session identity
- outbound dispatch into sidecar `/send`
-
-The sidecar owns:
-
- provider runtime state
- provider install/update commands
- Weixin QR login and login-state persistence
- Feishu/Lark plugin install, bot creation/linking, and provider-side verification
- platform receive loops
- sidecar-to-Beaver inbound event delivery
-
-## ConnectorProvider
-
-The sidecar must isolate provider-specific behavior behind a provider contract. Beaver must not know which provider implementation is active.
-
-```ts
-interface ConnectorProvider {
-  providerId: string;
-  connectors(): ConnectorDescriptor[];
-  health(): Promise<ProviderHealth>;
-  startSession(input: StartConnectorSessionInput): Promise<ConnectorSessionView>;
-  getSession(sessionId: string): Promise<ConnectorSessionView>;
-  cancelSession(sessionId: string): Promise<void>;
-  logout(connectionId: string): Promise<void>;
-  send(input: SendMessageInput): Promise<SendMessageResult>;
-}
-```
-
-Initial provider:
-
- `VendorCliProvider`: runs the real CLI/plugin commands required by the current Weixin and Feishu/Lark vendor flows.
-
-`VendorCliProvider` command execution is intentionally constrained:
-
- Command templates are read only from sidecar startup environment variables.
- Frontend requests and sidecar HTTP request bodies cannot provide command strings.
- Command working directory is fixed to `CONNECTOR_HOME`.
- Per-connection state paths may be passed to commands as formatted arguments.
- Every command has a hard timeout.
- stdout and stderr are redacted before storage or API responses.
-
-Future providers can be added without changing Beaver runtime code:
-
- `WechatyProvider`
- `NapcatProvider`
- `OneBotProvider`
- `EnterpriseWeixinProvider`
-
-Provider choice is sidecar configuration, not Beaver architecture. `ExternalConnectorChannel` only calls the sidecar HTTP contract.
-
-## Runtime Flow
-
-Inbound:
-
-```text
-platform event
-> ConnectorProvider inside sidecar
-> sidecar normalized bridge event
-> POST Beaver /api/channel-connector-bridge/events
-> MessageDedupeStore
-> ChannelRuntime.accept_inbound()
-> MessageBus
-> AgentService
-```
-
-Outbound:
-
-```text
-AgentService
-> MessageBus outbound
-> ChannelManager.dispatch_outbound()
-> ExternalConnectorChannel.send()
-> POST sidecar /send
-> ConnectorProvider.send()
-> platform
-```
-
-`ExternalConnectorChannel` implements the existing runtime channel protocol:
-
-```python
-channel_id: str
-kind: str
-mode: str
-
-async def start() -> None
-async def stop() -> None
-async def send(message: OutboundMessage) -> None
-```
-
-It is not a platform protocol adapter. It is a generic HTTP bridge to a sidecar.
-
-Runtime materialization for sidecar-backed connections always emits:
-
-```python
-ChannelConfig(
-    enabled=True,
-    kind="external_connector",
-    mode="http",
-    account_id=spec.account_id,
-    display_name=spec.display_name,
-    config={
-        "platformKind": "weixin",
-        "connectionId": "conn_...",
-        "sidecarBaseUrl": "http://external-connector:8787",
-    },
-    secrets={},
-)
-```
-
-The original `ChannelConnection.kind` remains `weixin` or `feishu`; only the runtime transport kind is generic.
-
-`ExternalConnectorChannel` authenticates outbound calls with the service-level connector token configured in Beaver's process environment, not with a per-channel secret. The same first-version deployment may use one shared token value for both directions, exposed as `EXTERNAL_CONNECTOR_TOKEN` to Beaver and `BEAVER_BRIDGE_TOKEN` to the sidecar.
-
-## Dynamic Runtime Activation
-
-A connected connector session must activate without restarting Beaver.
-
-Add runtime methods:
-
-```python
-async def add_channel(self, channel_id: str, config: ChannelConfig) -> None:
-    ...
-
-async def remove_channel(self, channel_id: str) -> None:
-    ...
-```
-
-`add_channel()` must run under a runtime lifecycle lock and has deterministic duplicate semantics:
-
- Same `channel_id` and same effective `ChannelConfig`: no-op.
- Same `channel_id` and changed effective `ChannelConfig`: build and start the replacement adapter before swapping it into the manager; after the swap succeeds, stop the old adapter.
- Replacement start failure: keep the old adapter registered and running, and return the failure to the caller.
- First registration after runtime start: build the adapter, register it, then start only that adapter.
-
-`remove_channel()` must also run under the lifecycle lock. Missing channel ids are no-op; existing channels are stopped and unregistered.
-
-When a connector session reaches `connected`:
-
-```text
-Connector session connected
-> connector updates ChannelConnection
-> registry materializes ChannelConfig
-> ChannelRuntime.add_channel(channel_id, config)
-> ChannelManager.register(adapter)
-> adapter.start()
-> channel status becomes running
-```
-
-This is a hard requirement for Weixin and Feishu/Lark onboarding. Manual backend restart is not an acceptable success path for this feature.
-
-`remove_channel()` is used when a user logs out or revokes a sidecar connection:
-
-```text
-logout / revoke
-> sidecar logout
-> ChannelRuntime.remove_channel(channel_id)
-> connection status revoked or disconnected
-```
-
-## Sidecar Deployment
-
-Add a sidecar service that can be enabled in deployment:
-
-```yaml
-services:
-  external-connector:
-    build: ./external-connector
-    restart: unless-stopped
-    environment:
-      BEAVER_BRIDGE_BASE_URL: http://app-instance:8080
-      BEAVER_BRIDGE_TOKEN: ${BEAVER_BRIDGE_TOKEN}
-      CONNECTOR_API_TOKEN: ${EXTERNAL_CONNECTOR_TOKEN}
-      CONNECTOR_HOME: /var/lib/external-connector
-      CONNECTOR_PROVIDER: vendor_cli
-      CONNECTOR_COMMAND_TIMEOUT_SECONDS: 120
-    volumes:
-      - external-connector-state:/var/lib/external-connector
-```
-
-For the current `create-instance.sh`-style deployment, the implementation adds:
-
- `docker-compose.external-connectors.yml` for local/development sidecar tests.
- documentation for attaching `external-connector` to the same Docker network as the target app instance.
- instance environment `EXTERNAL_CONNECTOR_BASE_URL=http://external-connector:8787`.
- instance environment `EXTERNAL_CONNECTOR_TOKEN=<service-level shared secret>`.
-
-The implementation must not depend on Beaver mounting `/var/run/docker.sock`.
-
-## Sidecar HTTP API
-
-All sidecar requests and responses are JSON. The sidecar listens on port `8787`.
-
-```text
-GET  /health
-GET  /connectors
-POST /connector-sessions
-GET  /connector-sessions/{session_id}
-POST /connector-sessions/{session_id}/cancel
-POST /connections/{connection_id}/logout
-POST /send
-```
-
-`GET /connectors` returns:
-
-```json
-[
-  {
-    "kind": "weixin",
-    "displayName": "Weixin",
-    "authType": "qr",
-    "providerId": "vendor_cli",
-    "capabilities": ["receive_text", "send_text", "receive_media", "direct_messages"]
-  },
-  {
-    "kind": "feishu",
-    "displayName": "Feishu/Lark",
-    "authType": "plugin_install",
-    "providerId": "vendor_cli",
-    "capabilities": ["receive_text", "send_text", "receive_media", "groups"]
-  }
-]
-```
-
-`POST /connector-sessions` request:
-
-```json
-{
-  "kind": "weixin",
-  "connectionId": "conn_...",
-  "channelId": "weixin-main",
-  "displayName": "Weixin Main",
-  "callbackBaseUrl": "http://app-instance:8080",
-  "options": {}
-}
-```
-
-The sidecar authenticates the connector-session request with `Authorization: Bearer <EXTERNAL_CONNECTOR_TOKEN>`. It already has `BEAVER_BRIDGE_TOKEN` from its environment, so Beaver does not send bridge tokens in connector-session bodies.
-
-For Feishu/Lark, `kind` is `feishu` and `options` may include `domain`, `mode`, and optional app credentials when linking an existing bot. If using the official plugin installer to create a bot, the sidecar starts that installer flow and reports QR, instruction, or action status back to Beaver.
-
-`GET /connector-sessions/{session_id}` response:
-
-```json
-{
-  "sessionId": "cs_...",
-  "kind": "weixin",
-  "status": "qr_ready",
-  "qrCode": "weixin://...",
-  "qrImage": "data:image/png;base64,...",
-  "instructions": [],
-  "accountId": null,
-  "displayName": null,
-  "error": null,
-  "metadata": {}
-}
-```
-
-Allowed connector session statuses:
-
- `pending`
- `qr_ready`
- `scanned`
- `confirmed`
- `installing`
- `waiting_for_user`
- `connected`
- `expired`
- `error`
- `cancelled`
-
-`POST /send` request:
-
-```json
-{
-  "requestId": "out_...",
-  "connectionId": "conn_...",
-  "channelId": "weixin-main",
-  "kind": "weixin",
-  "target": {
-    "peerId": "wx_user_or_chat_id",
-    "peerType": "dm",
-    "threadId": null
-  },
-  "content": "reply text",
-  "metadata": {
-    "contextToken": "optional"
-  }
-}
-```
-
-`requestId` is required. Beaver must generate a stable request id for each outbound delivery attempt and must reuse the same `requestId` if the same outbound delivery is retried. The first-version rule is:
-
-```text
-out_{channel}:{session_id}:{message_id or sha256(content + inbound_message_id + peer_id + finish_reason)}
-```
-
-The sidecar dedupes `connectionId + requestId`:
-
- `completed`: return the original send result and do not send a second platform message.
- `processing` updated less than 60 seconds ago: return `409 Conflict` with `{"retryAfterSeconds": 5}` so Beaver retries later.
- `processing` updated 60 seconds or more ago: treat as stale and retry the provider send.
-
-## Beaver Bridge API
-
-Add a backend bridge endpoint for sidecar inbound messages:
-
-```text
-POST /api/channel-connector-bridge/events
-```
-
-The sidecar must authenticate every bridge request using the service-level bearer token from `BEAVER_BRIDGE_TOKEN`. Beaver rejects missing or invalid bridge tokens. Bridge tokens are deployment secrets, not connection records.
-
-Bridge event body:
-
-```json
-{
-  "eventId": "provider-event-id",
-  "timestamp": "2026-06-02T09:30:00Z",
-  "deliveryAttempt": 1,
-  "connectionId": "conn_...",
-  "channelId": "weixin-main",
-  "kind": "weixin",
-  "accountId": "weixin:...",
-  "peerId": "wx_user_or_chat_id",
-  "peerType": "dm",
-  "userId": "wx_sender",
-  "threadId": null,
-  "messageId": "platform-message-id",
-  "messageType": "text",
-  "content": "hello",
-  "metadata": {
-    "contextToken": "optional"
-  }
-}
-```
-
-The bridge endpoint must:
-
-1. validate bearer token
-2. load `ChannelConnection`
-3. reject unknown or revoked connections
-4. dedupe by `connectionId + eventId` through `MessageDedupeStore`
-5. construct `ChannelIdentity`
-6. construct `InboundMessage`
-7. call `ChannelRuntime.accept_inbound()`
-8. mark bridge event completed or failed
-
-## MessageDedupeStore
-
-Add a JSON-backed `MessageDedupeStore` under:
-
-```text
-<workspace>/state/channel_connections/message_dedupe.json
-```
-
-It stores:
-
-```python
-@dataclass
-class ConnectorMessageDedupeRecord:
-    dedupe_key: str
-    connection_id: str
-    event_id: str
-    status: str
-    first_seen_at: str
-    updated_at: str
-    delivery_attempts: int
-    message_id: str | None
-    last_error: str | None
-```
-
-`status` values:
-
- `processing`
- `completed`
- `failed`
-
-Duplicate handling:
-
- `completed`: return idempotent success and do not call `ChannelRuntime.accept_inbound()` again.
- `processing` updated less than 60 seconds ago: return `409 Conflict` with `{"retryAfterSeconds": 5}` so the sidecar retries later.
- `processing` updated 60 seconds or more ago: treat the record as stale, increment `delivery_attempts`, update `updated_at`, and reprocess the event.
- `failed`: allow reprocessing on the next delivery attempt, increment `delivery_attempts`, and clear `last_error` before calling runtime.
-
-This store is separate from runtime session dedupe. Runtime dedupe still protects platform message identity, while bridge dedupe protects connector retries.
-
-## Beaver Connectors
-
-### WeixinConnector
-
-Responsibilities:
-
- discover sidecar health
- start Weixin connector session through sidecar `/connector-sessions`
- poll sidecar connector session status
- create or update `ChannelConnection`
- store sidecar connection state reference in `CredentialStore` when the provider returns one
- validate by checking sidecar connection status
- materialize runtime config for `ExternalConnectorChannel`
- activate runtime via `ChannelRuntime.add_channel()` when connected
- revoke/logout by calling sidecar `/connections/{connection_id}/logout`
- deactivate runtime via `ChannelRuntime.remove_channel()` on logout/revoke
-
-### FeishuConnector
-
-Responsibilities:
-
- discover sidecar health
- start Feishu/Lark plugin install/link connector session
- optionally pass appId/appSecret/domain/mode for existing bot linking
- poll installer/session status
- create or update `ChannelConnection`
- validate by sidecar session or connection status
- materialize runtime config for `ExternalConnectorChannel`
- activate runtime via `ChannelRuntime.add_channel()` when connected
- revoke/remove plugin connection by calling sidecar logout/remove API
- deactivate runtime via `ChannelRuntime.remove_channel()` on logout/revoke
-
-Feishu is sidecar-backed in this design because the user's supplied Feishu article describes the official plugin flow, not only a static bot-credential runtime adapter.
-
-## Frontend
-
-Replace the old static Weixin and Feishu fields with connector-driven UI:
-
- fetch `GET /api/channel-connectors`
- show Telegram, Weixin, and Feishu/Lark as connector options
- for Weixin:
-  - start connector session
-  - show QR image
-  - poll status until connected/expired/error
-  - show connected account and logout
- for Feishu/Lark:
-  - choose create bot or link existing bot
-  - collect domain and optional app credentials
-  - start sidecar connector session
-  - show QR/instructions/status returned by sidecar
-  - show connected account and logout
-
-The old `/api/channels` static config editor may remain for advanced runtime config, but connector onboarding should not rely on manual JSON editing or direct token entry for Weixin/Feishu.
-
-## Error Handling
-
- Sidecar unavailable: show connector as `unavailable`; do not create a running connection.
- Provider install command fails: status `error`, with redacted stderr summary.
- QR expired: status `expired`, user can start a new connector session.
- Bridge token invalid: reject with `401`, record event without platform secret values.
- Unknown connection id in bridge event: reject with `404`.
- Duplicate completed bridge event: return idempotent success and do not call runtime again.
- Duplicate in-flight bridge event: return `409 Conflict` until the 60-second processing TTL expires, then allow one reprocess.
- Outbound send failure: mark outbound delivery failed and record connector error.
- Duplicate completed outbound send `requestId`: sidecar returns the original send result and does not send a second platform message.
- Duplicate in-flight outbound send `requestId`: sidecar returns `409 Conflict` until the 60-second processing TTL expires, then allows one retry.
- Sidecar restart: persisted provider state should survive through sidecar volume.
-
-## Security
-
- Beaver never logs raw tokens, app secrets, bridge tokens, or sidecar connection tokens.
- Bridge authentication uses a service-level token from environment variables. It is not stored per connection and is never returned by APIs.
- Sidecar can only call bridge endpoints with the service-level bridge token.
- Beaver can only call sidecar control and send endpoints with the service-level connector token.
- Sidecar state volume contains login state and must be treated as sensitive.
- Vendor command strings are deployment configuration, not user input.
- Feishu user-identity mode has stronger privacy risk than bot-identity mode; UI must label it clearly if exposed.
-
-## Testing
-
-Backend unit tests:
-
- sidecar client fake for Weixin connector session start/status/logout/send
- sidecar client fake for Feishu connector session start/status/logout/send
- `ExternalConnectorChannel.send()` target mapping
- `ExternalConnectorChannel.send()` includes stable `requestId` and connector bearer auth
- `ChannelRuntime.add_channel()` dynamically starts and registers a channel
- `ChannelRuntime.add_channel()` no-ops for identical config, replaces changed config, and keeps the old channel if replacement start fails
- `ChannelRuntime.remove_channel()` stops and unregisters a channel
- bridge endpoint accepts valid events
- bridge endpoint rejects invalid token and unknown connection id
- bridge endpoint dedupes repeated `eventId` and calls runtime once
- bridge endpoint returns `409 Conflict` for non-stale `processing` duplicates and reprocesses stale records
- registry lists `telegram`, `weixin`, and `feishu`
- materialized sidecar connections produce `ChannelConfig(kind="external_connector", mode="http")` compatible with runtime factory
-
-Sidecar tests:
-
- HTTP API shape for health/connectors/connector-sessions/send
- fake provider status transitions
- provider command runner error redaction
- send idempotency for duplicate `connectionId + requestId`
- send `processing` TTL returns `409 Conflict` before stale retry
-
-Frontend tests:
-
- Weixin connector option opens QR modal
- polling reaches connected state
- expired/error states are visible
- Feishu flow starts install/link and shows returned instructions/status
-
-Manual verification:
-
- Build app and sidecar Docker images.
- Start docker-compose sidecar setup.
- In `terminaltest`, open Weixin connector, scan QR, observe connected status without restarting Beaver.
- Send a Weixin text message and verify Beaver receives it once.
- Force sidecar retry of the same event and verify Beaver does not produce a duplicate agent reply.
- Send a Beaver reply and verify sidecar `/send` path.
- Start Feishu connector flow using the official Feishu/Lark plugin install path and verify the vendor-provided start command.
-
-## Rollout
-
-Implement in this order:
-
-1. Sidecar HTTP contract with fake provider.
-2. `MessageDedupeStore`.
-3. Beaver `ExternalConnectorChannel` and bridge endpoint.
-4. `ChannelRuntime.add_channel()` and `ChannelRuntime.remove_channel()`.
-5. Weixin connector against fake sidecar client.
-6. Feishu connector against fake sidecar client.
-7. Frontend connector UI.
-8. Production `VendorCliProvider` that shells out to real vendor CLI/plugin commands.
-9. Docker build/compose integration.
-10. Manual live verification.
-
-The fake provider is test-only. The production provider must use the real vendor CLI/plugin commands for Weixin and Feishu/Lark; the fake provider only makes Beaver and frontend tests deterministic while the live provider handles non-deterministic external login and install flows.
--- a/docs/superpowers/specs/2026-06-04-auto-accept-on-new-topic-design.md
+++ b/docs/superpowers/specs/2026-06-04-auto-accept-on-new-topic-design.md
@ -1,60 +0,0 @@
-# Auto-Accept Task When a New Topic Starts
-
-## Goal
-
-Prevent unrelated follow-up conversation from being appended to the previous
-Task. When the Intent Agent decides that the user's current message starts a
-new topic, Beaver should silently accept the previous Task before processing
-the current message.
-
-## User Experience
-
- No confirmation dialog or extra assistant message is shown.
- A related follow-up or requested change continues the existing Task.
- An unrelated lightweight message is handled as `simple_chat`.
- Unrelated work that needs Task capabilities is handled as `new_task`.
- Before either new-topic path continues, the previous Task is formally
-  accepted.
-
-## Routing Rules
-
-The existing Intent Agent actions remain unchanged:
-
- `continue_task` and `revise_task` belong to the active Task.
- `close_task` and `abandon_task` keep their existing explicit semantics.
- With an active Task, `simple_chat` means an unrelated lightweight new topic.
- With an active Task, `new_task` means unrelated work that needs a separate
-  Task.
-
-The Intent Agent guidance must explicitly distinguish unrelated lightweight
-conversation from revisions. A message must not be classified as
-`revise_task` merely because an active Task is awaiting acceptance.
-
-## State Transition
-
-Before processing a `simple_chat` or `new_task` decision:
-
-1. Check whether the active Task is `awaiting_acceptance`.
-2. Find its latest completed run.
-3. Record a normal `accept` acceptance against that run.
-4. Continue processing the current message using the original routing
-   decision.
-
-The normal acceptance path must be reused so that the Task becomes `closed`,
-`final_accepted_run_id` is recorded, acceptance events are persisted, run
-memory is updated, and skill-learning candidates can be generated.
-
-Tasks without an acceptance-eligible completed run are left unchanged. Router
-failures retain the existing conservative `continue_task` fallback and must
-not auto-accept a Task.
-
-## Testing
-
-Backend tests must cover:
-
- An unrelated `simple_chat` message accepts the previous Task and is not
-  appended as another Task run.
- A `new_task` decision accepts the previous Task and creates a separate Task.
- `continue_task` and `revise_task` do not auto-accept the active Task.
- Router failure fallback does not auto-accept the active Task.
- Auto-accept records the final accepted run and normal acceptance history.
--- a/docs/superpowers/specs/2026-06-04-chat-task-timeline-consistency-design.md
+++ b/docs/superpowers/specs/2026-06-04-chat-task-timeline-consistency-design.md
@ -1,59 +0,0 @@
-# Chat Current-Task Timeline Consistency
-
-## Goal
-
-Make the chat page's current-session progress panel show the same timeline
-content as the active Task's detail page.
-
-## Visibility
-
- Show the chat-side timeline only while the current session has an active
-  Task.
- Hide the panel immediately when the Task is accepted, auto-accepted,
-  abandoned, closed, or when the user switches sessions.
- Do not show the most recently completed Task after it is no longer active.
-
-## Shared Data Model
-
-The Task detail page remains the canonical timeline behavior.
-
-Both surfaces must:
-
-1. Load the full `BackendTask` payload from `/api/tasks/{task_id}`.
-2. Combine the task's persisted process data with matching live process data.
-3. Use one shared task-process filtering helper.
-4. Build cards with `buildTaskTimelineCards()`.
-5. Render cards with `TaskTimeline`.
-
-This keeps card types, ordering, fallback milestones, result history,
-acceptance history, tool status, and deduplication consistent.
-
-## Chat Panel
-
-`CurrentSessionProgressSidebar` becomes a responsive wrapper around
-`TaskTimeline`.
-
- Desktop keeps the existing right sidebar.
- Smaller viewports keep the existing floating open button and drawer.
- The panel title remains "当前会话的运行进度".
- Timeline cards match the Task detail timeline.
- Chat does not render duplicate acceptance controls inside the sidebar,
-  because acceptance controls already exist on chat result messages.
-
-## Data Refresh
-
- Whenever the active Task changes, the chat page loads its full Task detail.
- Existing message, process, feedback, and WebSocket refresh paths reload both
-  the active Task identity and its full detail.
- If the active-task endpoint returns `null`, the cached active Task detail is
-  cleared immediately and the sidebar disappears.
- A task-detail load failure hides the sidebar rather than showing stale data.
-
-## Testing
-
- Shared process filtering returns the same task-scoped runs, events, and
-  artifacts for both surfaces.
- The chat-side timeline cards are produced by `buildTaskTimelineCards()`.
- No active Task produces no chat-side timeline.
- Switching to a closed/no-active Task clears the chat-side timeline.
- Frontend unit tests, typecheck, and production build pass.
--- a/docs/superpowers/specs/2026-06-08-skill-replay-eval-design.md
+++ b/docs/superpowers/specs/2026-06-08-skill-replay-eval-design.md
@ -1,219 +0,0 @@
-# Skill Replay Eval Design
-
-Related product planning artifacts:
-
- [Product Discovery Report](../../product-discovery/skill-replay-eval/product-discovery-report.md)
- [PRD](../../product-discovery/skill-replay-eval/PRD-skill-replay-eval.md)
- [Launch And Maintenance Runbook](../../product-discovery/skill-replay-eval/launch-maintenance-runbook.md)
-
-## Goal
-
-Improve skill draft evaluation so it measures real task behavior instead of relying on heuristic draft scoring. The new evaluation must cover every tool involved in a skill, while separating tools that can be executed safely from tools that require LLM surrogate judgment.
-
-This design also fixes revision draft generation dropping important content from the original skill by making base skill preservation an explicit contract.
-
-## Current State
-
-`SkillDraftEvaluator` currently builds a lightweight report from `candidate.source_run_ids`. It scores each historical run from `validation_result.score` or success fallback, then estimates candidate score from draft text. It does not replay the task, does not execute tools, and does not compare old skill behavior with draft skill behavior.
-
-`SkillDraftSynthesizer` currently receives candidate reason, related skill names, tool names, task summaries, and session excerpts. For revision and merge drafts, it does not receive the full base skill frontmatter and body, so generated drafts can accidentally omit important original instructions.
-
-## Design Principles
-
- All tools are part of evaluation coverage.
- Safe tools execute in an isolated replay environment.
- Unsafe or unavailable tools are not ignored; they are evaluated through an LLM surrogate using intended tool calls, schema, arguments, historical evidence, and expected effects.
- Evaluation reports must disclose execution coverage and surrogate coverage separately.
- Revision drafts must preserve original skill content unless a change is explicitly justified.
- Replay runs must not write to production workspace, user files, memory, third-party accounts, or external systems by default.
-
-## Evaluation Model
-
-Each draft eval selects up to 10 historical cases. If fewer than 10 eligible cases exist, use as many as available. If more than 10 exist, select the 10 most relevant cases.
-
-For `revise_skill`, select accepted historical runs that activated the target skill/version. Prefer recent accepted runs, then diversify by task and session.
-
-For `new_skill`, select candidate source runs and accepted runs with similar task themes.
-
-For `merge_skills`, select accepted runs where the related skills co-activated.
-
-Each case runs two arms:
-
- Baseline arm: no skill for `new_skill`, old skill for `revise_skill`, or old related skills for `merge_skills`.
- Candidate arm: draft skill injected as pinned draft guidance.
-
-Both arms use the same task text, same bounded historical context, same model settings, same max tool iterations, and same replay policy.
-
-## Tool Execution Modes
-
-Each tool call in replay resolves to one of these modes:
-
- `executed`: Tool was safely executed in replay environment.
- `surrogate`: Tool was not executed, but the intended call and expected effect were evaluated by LLM.
- `blocked`: Tool could not be executed or judged reliably.
-
-The goal is not to exclude third-party tools. It is to include them with the strongest safe evaluation method available.
-
-Examples:
-
- Filesystem reads and writes run against a temporary workspace clone.
- User file writes run against a temporary user-file namespace when available.
- Web/search reads can execute and cache outputs.
- Email/calendar/message sending to production systems does not execute by default. The replay records the intended call and evaluates it through surrogate judgment unless a sandbox/test connector is configured.
- Destructive actions such as delete, payment, permission changes, or irreversible external writes default to surrogate or blocked.
-
-## Replay Environment
-
-The replay runner creates isolated state per case and arm:
-
- Temporary session id.
- Temporary workspace root.
- Temporary task id or replay id.
- Tool call trace.
- Output artifacts.
- Side-effect journal.
- Captured final answer.
-
-This follows the OfficeBench MCP pattern: run tools in an isolated testbed where possible, pull artifacts and state after execution, then evaluate outside the runner. Beaver should reuse this shape without depending on OfficeBench's fixed benchmark functions.
-
-## Surrogate Evaluation
-
-When a tool cannot be safely executed, the agent should still be allowed to plan or attempt the tool call. The replay layer records:
-
- Tool name.
- Tool schema.
- Arguments.
- Tool classification reason.
- Historical accepted evidence.
- Expected side effect inferred from the task.
- Any assistant rationale around the call.
-
-The surrogate evaluator compares baseline and candidate intended effects. It scores whether the intended tool use would satisfy the task, whether arguments are complete and correct, and whether the call is risky, missing, duplicated, or unnecessary.
-
-Surrogate scoring contributes to the final candidate score, but lowers confidence compared with real execution.
-
-## Scoring
-
-Each case produces:
-
- `baseline_score`
- `candidate_score`
- `delta`
- `execution_coverage`
- `surrogate_coverage`
- `blocked_tool_count`
- `confidence`
- `tool_calls`
- `artifacts`
- `side_effects`
- `validator_notes`
-
-The draft report aggregates:
-
- Baseline mean.
- Candidate mean.
- Score delta.
- Improved count.
- Regression count.
- Unchanged count.
- Execution coverage.
- Surrogate coverage.
- Blocked coverage.
- Confidence.
-
-Publish gates should consider both score and confidence. A passing score with low confidence should require stronger human review, not automatic trust.
-
-## Draft Preservation
-
-Revision and merge synthesis must include base skill snapshots:
-
- Base skill name.
- Base version.
- Full base frontmatter.
- Full base content.
- Tool hints.
- Current published summary.
-
-The synthesis prompt must require the model to preserve existing instructions unless it explicitly changes them. The output remains a full proposed skill body, but it should also include:
-
- `preserved_sections`
- `changed_sections`
- `dropped_sections`
- `change_reason`
-
-After generation, a preservation checker compares base content and draft content. If critical sections disappear without explanation, the draft eval should mark preservation risk and require revision before approval.
-
-## API And Storage
-
-The existing `SkillDraftEvalReport` should be extended rather than replaced.
-
-Add fields for:
-
- `eval_version`
- `mode`, with values such as `heuristic`, `replay`
- `execution_coverage`
- `surrogate_coverage`
- `blocked_coverage`
- `confidence`
- `case_reports`
- `tool_mode_summary`
- `preservation_report`
-
-The existing simple fields remain for UI compatibility: `passed`, `baseline_score_avg`, `candidate_score_avg`, `score_delta`, `regression_count`, `improved_count`, `unchanged_count`, `cases`, and `status`.
-
-## UI
-
-The Skills draft review page should continue to show a concise summary first:
-
- Passed or failed.
- Baseline mean.
- Candidate mean.
- Delta.
- Execution coverage.
- Surrogate coverage.
- Confidence.
-
-Detailed sections show:
-
- Replay cases.
- Tool calls by mode.
- Blocked or surrogate reasons.
- Artifacts and side effects.
- Preservation report for revision drafts.
- Raw eval payload.
-
-The user should not need to configure per-tool policies for normal use. The report should explain coverage and uncertainty after the fact.
-
-## Error Handling
-
-If replay infrastructure fails before any case runs, eval status is `replay_error` and the draft cannot rely on replay pass.
-
-If some cases fail but others complete, eval status is `partial` and confidence is reduced.
-
-If a provider is unavailable, keep the current skipped-provider behavior but mark the report as no replay evidence.
-
-If all important tool calls become `blocked`, the draft should not pass automatically even if surrogate scoring is high.
-
-## Testing
-
-Unit tests should cover:
-
- Historical case selection for new, revise, merge candidates.
- Baseline and candidate arm construction.
- Tool mode classification and aggregation.
- Surrogate scoring payload construction.
- Preservation checker behavior.
- Publish gate behavior for low-confidence or blocked reports.
-
-Integration-style tests should use stub tools:
-
- A safe filesystem write tool that writes to temp workspace.
- An external write tool that is intercepted into surrogate mode.
- A mixed case where candidate improves one real artifact and one surrogate side effect.
-
-## Out Of Scope
-
- Real production third-party writes during automatic replay.
- Full Docker orchestration for all Beaver replay cases in the first implementation.
- Per-tool user policy UI.
- Replacing human review. Replay improves evidence but does not remove review gates.
--- a/docs/ui-ux/pages/auth-login.md
+++ b/docs/ui-ux/pages/auth-login.md
@ -1,220 +0,0 @@
-# 认证门户：登录页 UI/UX
-
-## 1. 页面定义
-
-| 项目 | 内容 |
-| --- | --- |
-| 页面名称 | 认证门户登录页 |
-| 真实页面路由 | `/login?next=<目标路径>` |
-| 页面实现 | `auth-portal/src/app/login/page.tsx` |
-| 样式实现 | `auth-portal/src/app/globals.css` |
-| API 客户端 | `auth-portal/src/lib/auth-client.ts` |
-| 主应用入口关系 | 主应用 `/login` 只显示跳转提示并重定向到认证门户，不承载登录表单 |
-| 核心任务 | 输入用户名和密码，完成认证后通过 handoff 返回目标工作区页面 |
-| 测试状态 | 已完成修复并复测通过；所有实测视口无横向越界、无小点击目标、无首屏控件出界 |
-
-## 2. 信息架构与组件层级
-
-```text
-portal-page
-├── portal-toolbar                         层级 z-index: 10，右上角
-│   └── LanguageSwitcher
-│       ├── Lang 标签
-│       ├── ZH 按钮
-│       └── EN 按钮
-└── auth-page                              全屏背景层
-    └── portal-panel                       登录卡片定位容器
-        └── auth-card.login-card           登录主容器，手机竖屏与横屏均做紧凑适配
-            ├── Boardware Logo
-            ├── 页面标题
-            ├── auth-form
-            │   ├── 用户名字段
-            │   │   ├── 可见 label
-            │   │   ├── 用户图标
-            │   │   └── 用户名 input
-            │   ├── 密码字段
-            │   │   ├── 可见 label
-            │   │   ├── 锁图标
-            │   │   ├── 密码 input
-            │   │   └── 显示或隐藏密码按钮
-            │   ├── 错误信息区域
-            │   └── 登录提交按钮
-            ├── “或”分隔线
-            └── 注册引导与注册链接
-```
-
-### 层级关系
-
- 背景图属于最低视觉层，用于传达产品品牌和 Agent/Memory/Tools 等能力氛围。
- 登录卡片是唯一主任务容器，具有半透明白色表面、边框和阴影。
- 语言切换器脱离登录卡片，绝对定位在页面右上角，`z-index: 10`。
- 表单错误位于密码字段与登录按钮之间，属于当前提交动作的内联反馈。
- 页面无弹窗、抽屉或二级详情层。
-
-## 3. 布局与大概位置
-
-### 桌面与宽屏，大于 920px
-
- 页面占满视口，背景图居中并 `cover`。
- 登录卡片位于页面右侧，垂直居中。
- `portal-panel` 宽度使用 `clamp(360px, 34vw, 560px)`。
- 页面右侧留白由 `clamp(24px, 8vw, 128px)` 控制。
- 语言切换器固定在右上角，距顶部约 `20px`，距右侧约 `24px`。
- 实测 `1365×900` 和 `1920×1080` 无横向越界、无组件重叠。
-
-### 平板与中等宽度，小于等于 920px
-
- 登录卡片从右侧布局切换为水平居中、靠近页面底部。
- 容器最大宽度约 `520px`。
- 页面顶部保留品牌背景展示空间。
- 实测 `768×1024` 和 `1024×768` 无横向越界、无组件重叠。
-
-### 手机，小于等于 640px
-
- 页面左右边距约 `16px`。
- 登录卡片在矮屏手机下进入紧凑节奏，缩小 Logo、标题和表单间距。
- 输入框高度约 `54px`，字号 `16px`，可避免 iOS 输入自动缩放。
- `320×568`、`375×667`、`390×844` 均可在首屏完整展示主要内容。
-
-### 手机横屏
-
- 实测 `844×390` 时使用横屏紧凑布局，主要控件完整位于视口内。
- 页面无横向越界，无双层滚动。
-
-## 4. 页面状态
-
-| 状态 | 当前表现 | UX 目的 | 测试结论 |
-| --- | --- | --- | --- |
-| 初始状态 | 显示 Logo、标题、空用户名和密码、登录按钮、注册入口 | 清晰建立品牌和唯一主任务 | 通过 |
-| 输入状态 | 输入框显示文本；密码默认掩码 | 防止密码旁观泄露 | 通过 |
-| 密码可见状态 | 点击眼睛按钮后，密码类型切换为 `text`，按钮名称同步变为“隐藏密码” | 降低密码输入错误 | 通过 |
-| 浏览器必填校验 | 空表单提交被浏览器阻止，并聚焦用户名 | 避免无效网络请求 | 通过 |
-| 提交中 | 登录按钮禁用，并显示“登录中...” | 防止重复提交，告知请求正在处理 | 通过 |
-| 登录失败 | 在密码字段下方显示友好错误，按钮恢复可用，错误区域可被播报并获得焦点 | 让用户修正凭据后重试 | 通过 |
-| 登录成功 | 构造 `/handoff?code=...&next=...` 并使用 `location.replace` 跳转 | 安全把认证结果交给目标工作区，并避免返回到已提交登录页 | 通过 |
-| 语言切换 | ZH/EN 立即更新字段、错误和辅助文案，并写入 cookie/localStorage | 支持中英文用户并保持选择 | 通过，刷新后保持 |
-| 注册跳转 | 跳转到 `/register` 并保留 `next` 参数 | 注册完成后仍返回原目标页 | 通过 |
-
-## 5. 操作与 UX 逻辑
-
-| 操作 | 触发方式 | 状态变化与反馈 | UX 目的 | 当前结果 |
-| --- | --- | --- | --- | --- |
-| 切换中文或英文 | 点击右上角 ZH/EN | 当前语言按钮高亮；页面文案立即更新；刷新后保持 | 降低语言理解成本 | 正常 |
-| 聚焦用户名或密码 | 点击或 Tab | 输入框边框、背景和外部阴影变化 | 明确当前输入位置 | 正常 |
-| 输入用户名 | 键盘输入 | 受控输入更新，支持 `autocomplete=username` | 减少重复输入 | 正常 |
-| 输入密码 | 键盘输入 | 默认掩码，支持 `autocomplete=current-password` | 保护敏感信息并支持密码管理器 | 正常 |
-| 显示或隐藏密码 | 点击眼睛图标 | `password/text` 类型切换，可访问名称同步切换 | 帮助用户核对密码 | 正常 |
-| 空表单提交 | 点击提交或按 Enter | 浏览器原生 required 校验阻止请求并聚焦用户名 | 及早阻止无效操作 | 正常 |
-| 有效表单提交 | 点击提交或在密码框按 Enter | 清空旧错误；按钮禁用；显示加载文案；发起登录请求 | 提供明确进度并防止重复提交 | 正常 |
-| 登录失败 | API 返回失败 | 显示本地化错误；按钮恢复可用；错误区域 `role="alert"`/`aria-live` 并获得焦点 | 支持修正后重试 | 正常 |
-| 登录成功 | API 返回 token 和 handoff code | 使用 `location.replace` 前往目标前端 handoff 页，保留 `next` | 完成认证并返回原任务 | 正常 |
-| 前往注册 | 点击注册链接 | 前往注册页并保留 `next` | 为无账号用户提供明确替代路径 | 正常 |
-
-## 6. 响应式测试矩阵
-
-测试日期：2026-06-04。浏览器：Playwright Chromium。
-
-| 视口 | 横向越界 | 页面纵向滚动 | 卡片内部滚动 | 卡片完整位于视口 | 结论 |
-| --- | --- | --- | --- | --- | --- |
-| `320×568` | 无 | 无 | 无 | 是 | 通过 |
-| `375×667` | 无 | 无 | 无 | 是 | 通过 |
-| `390×844` | 无 | 无 | 无 | 是 | 通过 |
-| `844×390` 横屏 | 无 | 无 | 无 | 是 | 通过 |
-| `768×1024` | 无 | 无 | 无 | 是 | 通过 |
-| `1024×768` | 无 | 无 | 无 | 是 | 通过 |
-| `1365×900` | 无 | 无 | 无 | 是 | 通过 |
-| `1920×1080` | 无 | 无 | 无 | 是 | 通过 |
-
-## 7. 可访问性与触控检查
-
-### 已通过
-
- 页面存在一个清晰的 `h1`。
- Logo 有 `alt="Boardware logo"`。
- 用户名和密码均存在与 input 关联的 label。
- Tab 顺序符合 DOM 和视觉顺序：ZH → EN → 用户名 → 密码 → 显示密码 → 登录 → 注册。
- 密码显示按钮具有动态可访问名称。
- 用户名和密码支持浏览器自动填充。
- 输入框、语言按钮、显示密码按钮、提交按钮和注册链接的实际命中区域均不小于 `44×44px`。
- 登录按钮具有本地化可访问名称。
- 登录失败错误使用 `role="alert"` 与 `aria-live`，并在失败后聚焦错误区域。
-
-### 待继续观察
-
- 本轮未使用真实屏幕阅读器做端到端朗读，只通过 DOM、焦点和 Playwright 辅助信息验证。
- 登录背景图片仍可继续做 WebP/AVIF 与响应式加载优化。
-
-## 8. 已修复问题与遗留优化
-
-### 已修复：手机横屏双层纵向滚动
-
- 复现：使用 `844×390` 访问登录页。
- 用户看到：卡片底部超出首屏；页面可滚动，卡片内部也可滚动。
- 影响：用户难以判断应滚动页面还是卡片；单独滚动卡片后仍无法看到注册链接。
- 相关实现：
-  - `.auth-page` 在 `<=920px` 时保留顶部和底部 padding。
-  - `.auth-card.login-card` 同时设置基于 `100vh` 的 `max-height` 和 `overflow-y:auto`。
- 复测结论：`844×390` 无横向越界、无双层滚动，提交和注册入口均在首屏可达。
-
-### 已修复：登录按钮缺少可访问名称
-
- 复现：使用键盘 Tab 到登录按钮，或检查辅助功能树。
- 用户影响：屏幕阅读器只能识别为无名称按钮。
- 相关实现：提交按钮默认仅渲染箭头 SVG，SVG 为 `aria-hidden`。
- 复测结论：按钮拥有本地化 `aria-label`，加载状态继续表达“登录中”。
-
-### 已修复：错误反馈缺少可访问播报和焦点恢复
-
- 复现：提交错误凭据。
- 当前反馈：显示“接口错误 401: 用户名或密码错误”，但无 `role="alert"`、无 `aria-live`，失败后焦点未落在错误或字段上。
- 用户影响：技术错误码增加理解成本；屏幕阅读器和键盘用户可能不知道提交已经失败。
- 复测结论：
-  - 用户文案不暴露 HTTP 状态与“接口错误”前缀。
-  - 文案说明原因和恢复方式，例如“用户名或密码错误，请检查后重试”。
-  - 错误使用 `role="alert"`/`aria-live`，并聚焦错误摘要。
-
-### 已修复：关键次级操作点击区域过小
-
- 影响范围：语言按钮、显示密码按钮、注册链接。
- 复测结论：语言按钮、显示密码按钮、提交和注册链接实际命中区域均达到至少 `44×44px`。
-
-### 遗留优化：登录背景资源偏大
-
- `login-background.png` 约 `1.3MB`，当前未提供 WebP/AVIF 或响应式尺寸。
- 用户影响：弱网和移动网络下首屏背景显示较慢。
- 建议验收标准：提供 WebP/AVIF，并根据视口加载合理尺寸；保留背景空间避免布局变化。
-
-## 9. 当前实现的正向 UX
-
- 页面只有一个主操作，层级清晰。
- 桌面使用背景品牌视觉和右侧卡片，主任务聚焦明确。
- 手机输入字号为 `16px`，避免 iOS 自动放大。
- 请求中禁用提交按钮，可防止重复登录。
- 错误区域预留最小高度，错误出现时不会明显推动后续内容。
- `next` 参数在注册跳转和成功 handoff 中均正确保留。
- 语言选择刷新后保持。
- 所有实测视口均无横向越界。
-
-## 10. 后续验收清单
-
- [x] 修复横屏双层滚动后，重新测试 `844×390` 和更低高度视口。
- [x] 为提交按钮添加本地化可访问名称。
- [x] 改善登录失败文案、错误播报和焦点恢复。
- [x] 扩大语言、显示密码和注册链接的触控区域。
- [x] 评估并实现持续可见的字段标签。
- [ ] 优化登录背景图片格式和响应式加载。
- [ ] 在 Safari、iOS Safari 和 Android Chrome 验证动态地址栏与 `100vh` 行为。
- [ ] 使用屏幕阅读器完成一次端到端登录测试。
-
-## 11. 本轮测试证据
-
- 自动化结果：`/tmp/beaver-login-qa-results.json`
- 截图目录：`/tmp/beaver-login-qa-shots`
- 临时测试脚本：`/tmp/beaver-ui-qa-tests/login-page-qa.spec.js`
- 测试命令：
-
-```bash
-./node_modules/.bin/playwright test login-page-qa.spec.js \
-  --config=/tmp/beaver-ui-qa-tests/pw.config.js \
-  --workers=1
-```