移除了agents/registry.json中的所有内置agents配置,将agents数组清空。 为web应用添加了CORS中间件支持,允许指定的前端地址跨域访问。 重构了技能上传功能,增加了LLM重写机制,自动规范化上传的技能格式。 新增了工具名称提取逻辑,从技能正文中自动识别Required Tools段落。 更新了技能学习候选者和草稿的载荷结构,添加评估报告统计信息。 修改了意图路由技能的说明,改进任务状态管理逻辑。
25 KiB
Beaver Product Discovery Report
Date: 2026-06-09
Product stage: existing product
Scope: the whole Beaver product, including deployment, runtime, UI, Agent execution, tasks, files, tools, skills, memory, connectors, scheduled work, governance, validation, launch, and maintenance.
Executive Summary
Beaver is an enterprise Agent sandbox and execution platform. Its product promise is to move AI from "chat that gives answers" to "controlled Agent work that creates deliverables, records evidence, asks for acceptance, and turns accepted work into reusable capability."
The strongest product wedge is not another chatbot UI. It is the full execution loop:
user request
-> task recognition
-> Agent/team execution
-> tool and file work
-> evidence timeline
-> user acceptance or revision
-> skill and memory learning
-> future reuse
The current codebase already supports major parts of this loop: multi-instance Docker deployment, auth portal, app instances, chat workbench, task center, task details, user acceptance, files, tools, skills, skill learning, marketplace, settings, connectors, scheduled jobs, and backend Agent team orchestration. The next product challenge is packaging these capabilities into a clear buyer story, validating the highest-value use cases, hardening operational reliability, and making governance understandable to non-engineer stakeholders.
Recommended product strategy:
- Position Beaver as "enterprise Agent execution and governance," not as a general AI chat app.
- Focus first on repeatable knowledge work that is high-frequency, cross-tool, evidence-sensitive, and review-heavy.
- Treat task acceptance, evidence, skills, and memory as the core product loop.
- Productize deployment and operations enough for pilots before broad feature expansion.
- Validate value through real workflows, not opinions about AI.
Product Summary
Product Description
Beaver is a private-deployable Agent workspace for teams that need AI to perform work, not only answer questions. A user can chat, upload files, trigger tasks, review execution evidence, accept or revise results, manage tools, install or publish skills, configure model providers, connect external systems, and run scheduled work.
Target Users
| Segment | Primary Need | Why Beaver Fits |
|---|---|---|
| Enterprise AI platform owner | Provide controlled Agent capability to teams | Private deployment, per-instance boundaries, tools, skills, governance |
| Knowledge workflow team | Finish recurring multi-step work faster | Task execution, files, tools, acceptance, scheduled work |
| Project / delivery team | Produce and revise deliverables with traceability | Task timeline, artifacts, evidence, revision loop |
| Engineering / support team | Use AI with files, commands, logs, and review | Tool execution, task evidence, multi-agent planning |
| Operations / admin | Configure models, users, connectors, and instances | Auth portal, deploy control, settings, status, logs |
| Skill owner / reviewer | Turn successful work into reusable methods | Skill candidates, drafts, safety/eval reports, review, publish |
Current Feature Map
| Domain | Current State | Product Meaning |
|---|---|---|
| Auth and onboarding | Auth portal, register/login, model provider onboarding | Users can enter a controlled workspace |
| Multi-instance deployment | Deploy control creates isolated app-instance containers; router proxy routes by host | Enables per-user or per-team sandboxing |
| Chat workbench | Conversations, attachments, task cards, current task progress, acceptance controls | Main user workspace |
| Task runtime | Auto task recognition, task creation, runs, timeline, status, acceptance | Converts chat into managed work |
| Agent execution | Unified engine, main agent, sub-agent/team execution, sequence/parallel/DAG coordinator | Handles complex work beyond one response |
| Tools | Built-in tools, MCP tools, tool management UI | Lets Agents act on files, web, terminal, integrations |
| Files | Workspace file browser, upload, preview, download, delete | Gives AI and users a shared working surface |
| Skills | Published skills, candidates, drafts, safety/eval, review, publish | Turns accepted work into reusable methods |
| Marketplace | Skill discovery/install flow | Foundation for capability distribution |
| Memory | Backend long-term memory foundation exists, product integration still incomplete | Future compounding personalization and organization knowledge |
| Scheduled work | Cron jobs, notifications, scheduled task flows | Moves from reactive chat to proactive work |
| Connectors | Outlook and external connector architecture; Feishu/Weixin-related sidecar paths | Brings Agent into real business channels |
| Settings/status/logs | Provider config, agent config, channel config, runtime status, restart | Admin control and troubleshooting |
Current Value Proposition
For enterprise teams:
Beaver provides a private Agent workspace where AI work is executed, tracked, reviewed, and reused. It gives teams the speed of AI assistance with the control needed for real business workflows.
For product pilots:
Beaver is strongest when a team has recurring knowledge work that crosses files, tools, systems, and reviews.
Current Challenges
| Challenge | Why It Matters |
|---|---|
| Product breadth is large | Buyers may not understand what to adopt first |
| Memory is partly backend-ready but not fully productized | "越用越懂" story needs visible control |
| Connector maturity varies by channel | Customer demos must avoid overpromising |
| Multi-instance deployment is powerful but operationally sensitive | Pilot success depends on stable setup and clear runbooks |
| Skill learning needs strong governance | Reuse can become risk if publishing is weak |
| Customer research is not yet captured | Current roadmap is inferred from implementation and product judgment |
User Segments
Segment 1: Enterprise AI Platform Owner
They need to safely introduce Agent capability into an organization. Their concern is not whether an LLM can answer a question; it is whether teams can use it without losing control of data, tools, cost, and quality.
Segment 2: Workflow Owner
They own a recurring process such as weekly reporting, project status, proposal drafting, research, operations follow-up, support triage, or document review. They want less manual coordination and more repeatable output.
Segment 3: Individual Knowledge Worker
They want one workspace where they can chat, upload files, run tools, generate artifacts, and continue a task until the output is usable.
Segment 4: Admin / Operator
They need to create instances, configure models, monitor status, debug logs, manage connectors, and keep deployment safe.
Segment 5: Skill Maintainer
They curate reusable skills, review drafts, evaluate safety, publish stable versions, and prevent low-quality automation from spreading.
JTBD
| User | Job Story | Current Alternative | Beaver Outcome |
|---|---|---|---|
| Platform owner | When teams ask for AI tools, I want a controlled Agent workspace so they can experiment without unmanaged SaaS sprawl | ChatGPT accounts, custom scripts, internal demos | Private, governed Agent workspace |
| Workflow owner | When a recurring process takes many manual steps, I want AI to execute and track it so my team can review the result | Manual docs, spreadsheets, Slack/email coordination | Task with timeline, artifacts, acceptance |
| Knowledge worker | When I ask AI to produce something, I want to revise and accept it as work, not just receive a message | Chat thread and copy/paste | Task lifecycle and deliverable loop |
| Admin | When a user registers, I want a workspace created and routed automatically so onboarding is repeatable | Manual container setup | Auth portal + deploy control + router proxy |
| Skill maintainer | When a task succeeds, I want to turn its method into a reusable skill so future tasks improve | Prompt docs, tribal knowledge | Skill candidate/draft/review/publish |
| Security reviewer | When Agents use tools, I want evidence and boundaries so I can audit behavior | Opaque model/tool calls | Tool traces, task evidence, instance sandbox |
Opportunity Areas
Opportunity scores are qualitative estimates from current docs and product context. They need validation with customer interviews and pilot data.
| Opportunity | Importance | Current Satisfaction | Opportunity Score | Notes |
|---|---|---|---|---|
| I need AI outputs to become reviewable tasks, not loose chat replies | 0.95 | 0.30 | 0.67 | Core wedge |
| I need evidence of what the Agent did | 0.90 | 0.35 | 0.59 | Governance driver |
| I need repeatable workflows to become reusable skills | 0.85 | 0.40 | 0.51 | Learning moat |
| I need private deployment and instance boundaries | 0.90 | 0.45 | 0.50 | Enterprise adoption |
| I need AI to work across files, tools, and external systems | 0.85 | 0.45 | 0.47 | Workflow depth |
| I need proactive scheduled work, not only reactive chat | 0.70 | 0.45 | 0.39 | Expansion opportunity |
| I need memory that I can inspect and control | 0.80 | 0.25 | 0.60 | High future leverage |
Top opportunities:
- Make AI work reviewable and acceptable.
- Make process evidence and governance visible.
- Turn accepted work into reusable skills and memory.
Product Positioning
Recommended primary positioning:
Beaver is an enterprise Agent execution and governance platform for repeatable knowledge work.
Supporting message:
It gives teams a private Agent sandbox where AI can use tools, manage files, execute tasks, record evidence, ask for acceptance, and learn reusable skills from approved work.
Avoid positioning Beaver as:
- A generic chatbot.
- A pure model gateway.
- A standalone RPA replacement.
- A developer-only Agent framework.
- A marketplace-only skill product.
Competitive Frame
| Category | Strength | Gap Beaver Addresses |
|---|---|---|
| AI chat apps | Fast answers and content generation | Weak task lifecycle, evidence, acceptance, and reuse |
| RPA / automation | Repeatable process execution | Rigid flows, harder natural-language adaptation |
| Agent frameworks | Developer flexibility | Missing complete user workspace and governance surface |
| Internal scripts | Fast local automation | Poor product UX, auditability, onboarding, and scaling |
| Enterprise AI platforms | Governance and admin | Often weak on task-level execution and skill learning loop |
Product Ideas
Generated from PM, design, and engineering perspectives.
PM Ideas
- Pilot Workflow Templates: package 3-5 high-value workflows such as weekly report, project brief, support triage, document review.
- Team Workspace Mode: group multiple users under one organization workspace with shared skills and controlled memory.
- Governance Scorecard: show evidence coverage, accepted tasks, skill reuse, failed runs, and tool risk.
- Skill Quality Lifecycle: strengthen candidate -> draft -> safety -> eval -> review -> publish -> version rollback.
- ROI Dashboard: measure time saved, accepted tasks, revision rounds, reusable skill adoption.
Design Ideas
- Work Inbox: unify tasks, scheduled runs, notifications, and pending reviews.
- Task Evidence Narrative: convert raw events into readable "what happened" timeline.
- Memory Control Center: show what Beaver remembers, why, source, confidence, and edit/delete controls.
- First-Run Product Tour: guide a new user from provider setup to first accepted task.
- Admin Health Console: one page for instance, provider, connector, queue, and runtime health.
Engineering Ideas
- Tenant/Workspace Policy Profiles: control allowed tools, connectors, memory behavior, and publish gates per deployment.
- Connector Sandbox Layer: test external channel actions without touching production systems.
- Unified Evidence Schema: normalize task, tool, artifact, skill, memory, and connector events.
- Replay-Based Skill Evaluation: evaluate skill drafts against historical accepted runs.
- Instance Lifecycle Automation: suspend, resume, backup, restore, rotate secrets, inspect health.
Top 5 product ideas to pursue:
| Rank | Idea | Why Selected | Assumptions |
|---|---|---|---|
| 1 | Pilot Workflow Templates | Gives customers a concrete starting point | Initial buyers share common workflows |
| 2 | Task Evidence Narrative | Makes governance understandable | Reviewers value readable evidence |
| 3 | Memory Control Center | Unlocks long-term differentiation | Users trust memory if they can inspect/control it |
| 4 | Governance Scorecard | Helps buyers justify adoption | Platform owners need measurable proof |
| 5 | Instance Lifecycle Automation | Reduces pilot operational risk | Deployments will grow beyond a few instances |
Key Assumptions
| Assumption | Category | Impact | Uncertainty |
|---|---|---|---|
| Enterprise teams feel enough pain with chat-only AI to adopt an Agent workspace | Value | High | Medium |
| Task acceptance is a meaningful quality signal | Value | High | Medium |
| Users will tolerate a task workflow instead of expecting instant chat only | Usability | High | Medium |
| Per-instance deployment is operationally acceptable for early customers | Feasibility | High | Medium |
| Workflow owners can identify repeatable tasks worth piloting | Value | High | Low |
| Skill reuse creates visible productivity gains | Business Viability | High | High |
| Memory control is required before customers trust long-term memory | Trust | High | Medium |
| Connectors are necessary for customer stickiness | Value | Medium | Medium |
| Admins can manage model provider configuration without heavy support | Usability | Medium | Medium |
| The team can maintain broad product surface without quality drift | Team Capability | High | High |
Prioritized Assumptions
P0 Validate Immediately
| Assumption | Why It Matters | What Could Go Wrong | Validation |
|---|---|---|---|
| Customers prefer task-based AI execution over chat-only for real work | Core product wedge | Users see tasks as overhead | Run 3 workflow pilots and compare chat-only vs task loop |
| Evidence timeline increases trust | Governance story depends on it | Evidence is too technical or noisy | Reviewer usability test with task timelines |
| Private multi-instance deployment is acceptable | Adoption depends on ops fit | Setup too fragile or expensive | Deploy pilot on fresh Linux host and measure time/errors |
| Accepted tasks can generate reusable skills that users value | Learning loop depends on this | Skills are low quality or unused | Track reuse of skills from accepted pilot tasks |
P1 Important
| Assumption | Why It Matters | Validation |
|---|---|---|
| Memory control center is required before broad rollout | Trust and differentiation | Interview pilot admins and users |
| Connectors drive retention | External systems make workflows real | Compare pilot workflows with and without Outlook/IM connectors |
| Scheduled work creates high-value usage | Moves Beaver from reactive to proactive | Test weekly report and reminder workflows |
| Marketplace/skill distribution is a buyer requirement | Scaling reuse across teams | Ask platform owners during procurement |
P2 Later
| Assumption | Why It Matters | Validation |
|---|---|---|
| Multi-user team workspace is required for first paid pilots | Could reshape architecture | Validate with buyer interviews |
| Fine-grained per-tool policies are needed in UI | Admin complexity | Observe support requests |
| Cross-instance organization analytics is needed early | Enterprise reporting | Validate after 2-3 pilots |
Opportunity Solution Tree
Desired outcome:
Within 90 days, prove that a pilot team can complete repeatable AI-assisted work with acceptance, evidence, and reuse: at least 30 accepted tasks, 5 reusable skills, 2 recurring workflows, and 0 critical deployment/security incidents.
Outcome: Trusted repeatable Agent work in pilot teams
Opportunity 1: I need AI outputs to become reviewable deliverables.
Solution 1.1: Task lifecycle with acceptance and revision.
Experiment: Run a project brief workflow and measure accepted output rate.
Solution 1.2: Task details page with evidence narrative.
Experiment: Ask reviewers to reconstruct what happened from timeline.
Solution 1.3: Work Inbox for pending reviews and scheduled outputs.
Experiment: Fake-door navigation item and measure clicks/asks.
Opportunity 2: I need confidence that Agent tool use is controlled.
Solution 2.1: Tool traces and artifact timeline.
Experiment: Security review of 5 real tasks.
Solution 2.2: Admin health and policy console.
Experiment: Operator performs setup/debug checklist on fresh instance.
Solution 2.3: Connector sandbox and side-effect journals.
Experiment: Test external send/reply flows in sandbox mode.
Opportunity 3: I need successful work to become reusable.
Solution 3.1: Skill candidate -> draft -> review -> publish.
Experiment: Convert 5 accepted tasks into skills and track reuse.
Solution 3.2: Memory Control Center.
Experiment: Prototype memory review UI and test trust/comprehension.
Solution 3.3: Pilot workflow templates.
Experiment: Package 3 templates and measure first-task success rate.
Validation Experiments
| Assumption | Hypothesis | Experiment | Duration | Success Criteria |
|---|---|---|---|---|
| Task loop beats chat-only | Users complete more usable work with task acceptance than plain chat | Same workflow performed in chat-only and Beaver task loop | 1 week | Beaver output accepted in fewer revision rounds |
| Evidence creates trust | Reviewers can understand and audit what happened | Give 5 timelines to reviewers | 2 days | >=80% identify tools, artifacts, result, and risk |
| Deployment is pilot-ready | Fresh host setup is repeatable | Deploy on clean Linux/WSL2 machine using docs | 1 day | Setup under 2 hours with no undocumented step |
| Skills create reuse | Accepted tasks can become useful skills | Convert 5 pilot tasks into skills | 2 weeks | 3 skills reused at least twice |
| Memory needs control UI | Users trust memory more with inspect/edit/delete | Clickable prototype or simple page | 3 days | >=80% say they would enable memory with controls |
| Scheduled work matters | Recurring workflows create repeat usage | Weekly report or reminder pilot | 2-4 weeks | At least 2 recurring jobs run and get accepted outputs |
Feature Prioritization
Must Have
| Feature | Impact | Effort | Risk | Reason |
|---|---|---|---|---|
| Auth portal and instance onboarding | High | High | Medium | Required for any user to start |
| Provider configuration flow | High | Medium | Medium | Model access is prerequisite |
| Chat workbench | High | High | Medium | Primary user surface |
| Task lifecycle and acceptance | High | High | Medium | Core differentiation |
| Task timeline/evidence | High | High | Medium | Governance and review |
| Files workspace | High | Medium | Medium | Most real workflows need files |
| Tool management | High | Medium | High | Agents need controlled action surface |
| Skills review/publish | High | High | High | Reuse loop |
| Settings/status/logs | High | Medium | Medium | Operational support |
| Basic deployment guide/runbook | High | Medium | Medium | Pilot readiness |
Should Have
| Feature | Impact | Effort | Risk | Reason |
|---|---|---|---|---|
| Pilot workflow templates | High | Medium | Low | Creates adoption path |
| Evidence narrative layer | High | Medium | Medium | Makes audit readable |
| Memory Control Center | High | High | Medium | Unlocks long-term trust |
| Skill replay/eval hardening | High | High | High | Makes learning safer |
| Scheduled workflow polish | Medium | Medium | Medium | Supports proactive use cases |
| Connector onboarding polish | Medium | High | High | Needed for real systems |
| Admin health console | Medium | Medium | Medium | Reduces support load |
Could Have
| Feature | Reason |
|---|---|
| Multi-user organization workspace | Valuable, but changes scope and permissions |
| Cross-instance analytics | Useful after multiple deployments |
| Fine-grained policy UI | Need policy demand before UI complexity |
| Audit export | Strong sales support, not first pilot blocker |
| Cost/quality model router | Useful after usage volume grows |
Not Yet
| Feature | Reason |
|---|---|
| Broad public SaaS launch | Product and ops need pilot hardening first |
| Fully autonomous publish of skills | Human review should remain mandatory |
| Production writes through connectors without review | Trust risk |
| Complex enterprise RBAC before pilot validation | May overbuild before segment clarity |
Customer Research Plan
No direct interview transcripts were provided. Research should start immediately before locking roadmap.
Participants
- 5 knowledge workers with recurring document/report/research workflows.
- 3 workflow owners or team leads.
- 3 enterprise AI platform/admin stakeholders.
- 2 security or IT reviewers.
- 2 engineers/operators who would deploy and maintain Beaver.
Questions
- What recurring work is painful enough to delegate to an Agent?
- What would make an AI output "acceptable" instead of just "interesting"?
- What evidence do you need to trust Agent work?
- What systems must the Agent connect to for the workflow to matter?
- What would make you stop a pilot?
- What memory or reuse behavior feels helpful vs risky?
- What does a successful 30-day pilot need to prove?
Interview Guide
Opening
We are studying how teams move AI from chat into real work. We are not asking whether you like an idea. We want examples of work you recently did.
Current Behavior
- Walk me through the last time you used AI for a real work deliverable.
- What happened after the AI gave an answer?
- What did you copy, edit, verify, or redo manually?
- Who reviewed the result?
Pain
- What was the slowest or most annoying part?
- What made the output hard to trust?
- What tools or files were involved?
- What evidence did you need but did not have?
Reuse
- Have you repeated a similar workflow since then?
- Did you reuse prompts, templates, scripts, or notes?
- What would make that reuse safe for a team?
Governance
- What AI actions would need approval?
- What data or tools should be off limits?
- Who needs to see the history of what happened?
Pilot
- Which one workflow would you test first?
- What result would make you expand usage?
- What failure would make you stop?
Recommended Next 30 Days
- Pick 2-3 pilot workflows: project brief, weekly report, document review, support triage, or file processing.
- Run fresh deployment rehearsal from README/deployment guide and record gaps.
- Define pilot learning questions and instrument the events needed to answer them.
- Create a task evidence narrative prototype on top of existing timeline data.
- Package pilot workflow templates as skills or documented demos.
- Validate provider onboarding with 3 non-engineer users.
- Run security review for file boundaries, tool execution, connectors, and deploy-control exposure.
- Decide which connector(s) are pilot-safe.
Recommended Next 90 Days
- Complete Memory Control Center MVP.
- Harden skill learning with replay/eval and publish gates.
- Add Admin Health Console for provider, instance, connector, task queue, and runtime status.
- Improve instance lifecycle: suspend, resume, backup, restore, rotate secrets.
- Add customer-facing pilot scorecard.
- Formalize tool/connector policy profiles.
- Expand pilot from one workflow to one department.
- Build audit export after evidence narrative stabilizes.
Biggest Risks
| Risk | Severity | Mitigation |
|---|---|---|
| Product appears too broad and hard to adopt | High | Lead with pilot workflows and task loop |
| Deployment complexity blocks pilots | High | Rehearsed runbook, health checks, support checklist |
| Agent actions cause unintended side effects | Critical | Conservative tool policy, explicit connector sandboxing, evidence logs |
| Task evidence is too technical | High | Evidence narrative and reviewer testing |
| Skill learning publishes poor methods | High | Human review, safety/eval, replay validation |
| Memory feels creepy or uncontrollable | High | Memory control UI before broad enablement |
| Team spreads effort across too many modules | High | Prioritize task loop, evidence, skills, deployment reliability |
Recommended Immediate Actions
- Reframe all main product docs around Beaver as an Agent execution and governance platform.
- Treat Skill Replay Eval as a subfeature under the skill governance loop.
- Build the next roadmap around pilot workflows, not isolated modules.
- Make accepted tasks the main success metric.
- Productize memory and evidence before adding many new connectors.
- Prove deployment repeatability before selling broad private deployments.