Files

steven_li fc9fd93c36 feat: 支持多语言提示词本地化和界面优化

- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化
- 移除内置 agents 配置以简化系统架构
- 更新 ContextBuilder 使用动态提示词模板而非硬编码内容
- 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数
- 添加输出语言指令确保用户界面内容按指定语言生成
- 扩展前端 LanguageSwitcher 组件支持三种语言选项
- 优化 Header 和侧边栏组件的响应式布局和文本截断处理
- 更新测试用例验证不同语言环境下的提示词正确性

2026-06-10 16:11:05 +08:00

29 KiB

Raw Blame History

Skill Replay Eval Product Discovery Report

Date: 2026-06-09

Product stage: existing product

Primary feature: Skill Replay Eval for Beaver skill learning and publishing

Source context:

Existing product and deployment: README.md, 部署指南.md
Feature design: docs/superpowers/specs/2026-06-08-skill-replay-eval-design.md
Delivery plan: docs/superpowers/plans/2026-06-08-skill-replay-eval.md
Current implementation signals: beaver/skills/learning/{case_selection,preservation,replay,surrogate,eval}.py, Skills page replay report UI, publish gate checks
Customer positioning: docs/presentations/skill-replay-eval/index.html

Executive Summary

Beaver is positioned as an enterprise Agent execution and governance platform. Its core value is not only running tasks, but also making AI work traceable, acceptable, reusable, and governable. Skill Replay Eval is the quality gate that makes the "reusable skill" promise credible: before a skill draft is published, Beaver should test whether it improves or preserves real task behavior.

The current design correctly identifies the product risk: heuristic-only skill scoring is not enough for enterprise trust. A draft skill can look complete in text while causing tool misuse, dropping safety instructions, or regressing accepted workflows. Replay evaluation closes this gap by comparing baseline and candidate behavior on accepted historical tasks, classifying tool calls into executed, surrogate, or blocked modes, and adding a preservation check for revised skills.

The product direction should be: ship replay eval as a staged trust feature, not as a perfect benchmark system. The first release should make evaluation coverage and uncertainty visible, block obvious regressions, and give reviewers enough evidence to approve or reject drafts. The next releases should improve case quality, sandbox isolation, surrogate judgment quality, and operational dashboards.

Product Summary

Product Description

Skill Replay Eval is a review and publishing gate for Beaver skills. It evaluates skill drafts against prior accepted task runs and shows whether the draft improves, preserves, or harms real task outcomes. It separates safe tool execution from surrogate evaluation for unsafe or unavailable tools, and it checks whether revised skill drafts preserve important original instructions.

Target Users

Segment	Job To Be Done	Success Looks Like
Enterprise AI platform owner	Govern reusable Agent capabilities before they spread across teams	No risky skill is published without evidence, review, and audit trail
Skill reviewer / admin	Decide whether a skill draft is good enough to approve	Replay report explains score, coverage, regressions, and preservation risks
Internal workflow owner	Convert accepted tasks into repeatable team methods	Similar future tasks become faster and more reliable
Engineer / implementer	Build and debug the eval pipeline	Replay failures are reproducible, scoped, and observable
Security / IT reviewer	Understand side effects and tool risk	Production writes are not executed during automatic replay

Current Features

Existing Beaver product capabilities relevant to this feature:

Task lifecycle: route, plan, execute, track, accept, modify, or abandon.
Evidence and timeline: tool calls, artifacts, task status, and validation signals.
Skill learning: candidates, drafts, safety report, eval report, review, publish.
Multi-instance deployment: isolated app-instance per user/team via Docker.
Tool and connector framework: local tools, MCP tools, external connectors, files, web/search, scheduled tasks.

Current Skill Replay Eval implementation signals:

SkillDraftEvalReport has replay fields: mode, eval version, execution coverage, surrogate coverage, blocked coverage, confidence, case reports, tool mode summary, and preservation report.
select_replay_cases() selects up to 10 accepted historical runs by candidate type.
ReplayToolExecutor classifies tool calls as executed, surrogate, or blocked.
ReplayRunner runs baseline and candidate arms through AgentLoop with a replay tool executor.
SurrogateToolEvaluator scores non-executed calls through deterministic intended-call heuristics.
Publish gates block low-confidence replay reports, fully blocked replay reports, and failed preservation reports.
Skills UI exposes execution coverage, surrogate coverage, confidence, replay cases, raw case reports, and preservation reports.

Current Architecture

Accepted task runs
  -> SkillLearningCandidate
  -> SkillDraft
  -> case selection
  -> baseline arm and candidate arm
  -> replay tool executor
       -> executed tools for safe toolsets
       -> surrogate traces for external writes or unsafe integrations
       -> blocked traces for destructive calls
  -> surrogate scoring and coverage aggregation
  -> preservation checker for revise/merge
  -> SkillDraftEvalReport
  -> Skills review UI
  -> publish gate

Product boundary:

Replay Eval should evaluate skill behavior, not replace human review.
Replay Eval should never write to production workspace, user files, external accounts, third-party systems, credentials, permissions, or payments by default.
Low confidence should increase review burden instead of creating false certainty.

Current Value Proposition

For enterprise users, Beaver can say: "Accepted work can become reusable skills, and those skills are checked against real task behavior before they are published." This directly supports Beaver's larger promise of controlled, traceable, reusable Agent execution.

Current Challenges

Challenge	Product Impact	Current Risk
Historical accepted runs may be sparse or low quality	Replay evidence can be weak	Medium
Surrogate scoring is currently simple	Unsafe tool calls may be judged with low fidelity	High
Replay environment isolation must be enforceable	Enterprise trust depends on no accidental production side effects	High
Reviewers need clear explanations	Raw case reports can overwhelm non-engineers	Medium
Publish gates may be too strict or too loose	Either slows adoption or lets regressions through	Medium
Skill preservation is section-based	Important instruction changes inside a section may be missed	Medium

Missing Information And Ambiguities

No real customer interview data is provided for skill reviewers, enterprise admins, or workflow owners.
No baseline metrics exist for current heuristic eval false positives or false negatives.
No defined quality threshold exists for minimum acceptable replay coverage per skill category.
No clear operational owner is assigned for replay failures, low confidence reports, or blocked tool classifications.
No explicit policy matrix exists per toolset, customer deployment mode, or tenant risk tier.
No customer-facing language has been finalized for explaining surrogate evaluation limitations.

User Segments

Segment 1: Skill Governance Admin

This user owns skill approval. They need a reliable way to decide whether a skill should be published. Their main pain is that a skill draft can appear well-written but still fail on real tasks.

Segment 2: Enterprise AI Platform Buyer

This user evaluates Beaver as an internal AI platform. They care about risk, adoption, cost, governance, and operational control. They need to see that reusable Agent capabilities are not published blindly.

Segment 3: Workflow Owner

This user has repeatable work such as weekly reports, project delivery, technical support, or file processing. They want accepted workflows to become faster and more consistent over time.

Segment 4: Beaver Engineer / Operator

This user debugs replay failures, expands safe tool coverage, adjusts publish gates, and keeps the eval pipeline reliable.

JTBD

User	Job Story	Current Alternative	Desired Outcome
Skill reviewer	When a skill draft is ready, I want to see whether it works on prior accepted tasks, so I can approve it with evidence	Read the draft manually	Approve, reject, or revise with confidence
Admin	When a skill touches tools, I want to know what would execute, what is simulated, and what is blocked, so I can manage risk	Trust reviewer judgment	Clear coverage and side-effect evidence
Workflow owner	When my accepted task becomes a reusable skill, I want it to preserve what made the original task successful	Rewrite prompts manually	Similar future work gets better
Operator	When replay fails, I want to know whether the issue is provider, tool policy, case data, or candidate behavior	Read logs manually	Fast diagnosis and recovery

Alternative Product Positioning

Positioning	Strength	Weakness	Recommendation
"Skill unit tests for Agents"	Easy for engineers to understand	Too narrow; suggests deterministic tests only	Use in engineering docs
"Replay-based skill quality gate"	Accurate and product-relevant	Needs explanation for non-technical buyers	Primary internal positioning
"Enterprise Agent governance evidence"	Strong for buyers	Less precise for builders	Use in sales and customer docs
"A/B testing for skill drafts"	Captures baseline vs candidate	May imply live user traffic experiments	Use carefully

Recommended positioning:

Skill Replay Eval is Beaver's evidence-based quality gate for reusable Agent skills. It replays accepted historical tasks, compares baseline and candidate behavior, and exposes execution coverage, surrogate coverage, regressions, and preservation risk before publication.

Opportunity Areas

Opportunity	Importance	Current Satisfaction	Opportunity Score	Notes
I need proof that a skill draft improves real task behavior	0.95	0.25	0.71	Core opportunity
I need automatic replay to avoid unsafe side effects	0.95	0.35	0.62	Required for enterprise trust
I need reports that are understandable to reviewers	0.85	0.35	0.55	Key adoption driver
I need preservation of existing skill instructions	0.80	0.45	0.44	Important for revisions
I need replay failures to be diagnosable	0.75	0.40	0.45	Operational maturity
I need configurable policy per deployment	0.70	0.30	0.49	Later enterprise hardening

Top opportunities:

Evidence that a draft improves or preserves accepted task behavior.
Safe replay with explicit executed/surrogate/blocked coverage.
Reviewer-facing explanation that turns raw traces into decisions.

Product Expansion Ideas

Generated from PM, Designer, and Engineer perspectives.

Product Manager Ideas

Replay Readiness Score: show whether a draft has enough historical evidence before eval starts.
Skill Release Gate Levels: allow advisory, strict, and regulated gates per workspace.
Regression Triage Queue: collect failed cases and route them to skill authors.
Customer-facing Audit Export: export replay report as PDF/Markdown for security review.
Skill Quality Trend: show whether a skill improves or degrades across versions.

Product Designer Ideas

Reviewer Decision View: summarize "approve / revise / reject" with reasons before raw JSON.
Coverage Timeline: visualize executed, surrogate, and blocked calls per case.
Preservation Diff: show dropped or changed sections in a readable comparison.
Replay Case Drilldown: task text, expected behavior, baseline output, candidate output, and validator notes.
Confidence Language: translate low/medium/high confidence into concrete reviewer actions.

Engineer Ideas

Pluggable Tool Policy Registry: classify tools by toolset, transport, metadata, and deployment risk.
Deterministic Replay Fixtures: save replay inputs and traces for reproducible debugging.
Sandbox User File Namespace: isolate user-file writes per replay arm.
LLM Surrogate Provider: replace deterministic heuristics with structured model judgment when available.
Replay Telemetry: metrics for replay latency, failure mode, blocked coverage, and provider availability.

Top 5 selected ideas:

Rank	Idea	Why Selected	Assumptions To Validate
1	Reviewer Decision View	Converts technical eval into action	Reviewers trust summarized recommendations
2	Sandbox User File Namespace	Directly addresses production side-effect risk	Existing file tooling can be redirected cleanly
3	LLM Surrogate Provider	Improves unsafe tool judgment quality	LLM judgment is consistent enough for review support
4	Replay Readiness Score	Prevents weak reports from appearing authoritative	Enough metadata exists to estimate readiness
5	Preservation Diff	Makes revision risk visible and actionable	Section and body-level diffs catch meaningful drops

Key Assumptions

Assumption	Category	Impact	Uncertainty
Accepted historical runs are representative enough to evaluate future skill behavior	Value	High	High
Reviewers will use replay reports to make better publish decisions	Value	High	Medium
Safe tools can execute in isolation without leaking state or causing production side effects	Feasibility	High	High
Surrogate evaluation can judge unsafe tool calls well enough to support review	Feasibility	High	High
Coverage and confidence are understandable to non-engineer reviewers	Usability	Medium	High
Publish gates will reduce risky releases without blocking too many useful skills	Viability	High	Medium
Skill preservation can be detected with lightweight section checks in v1	Feasibility	Medium	Medium
Replay latency will be acceptable for review workflows	Usability	Medium	Medium
Customers will value replay eval enough to differentiate Beaver from generic Agent tools	Business Viability	High	Medium
The team can maintain tool policy as tools/connectors grow	Team Capability	High	Medium

Prioritized Assumptions

Priority = Impact x Uncertainty.

P0 Validate Immediately

Assumption	Why It Matters	What Could Go Wrong	Suggested Validation
Safe replay isolation is real, not only conceptual	One accidental external write can break trust	Replay calls production filesystem, connector, or credential paths	Technical isolation test with destructive and external-write tools
Replay reports help reviewers make better decisions	Product value depends on review decisions changing	Reports are too raw, ignored, or misunderstood	Reviewer usability test with 5 draft decisions
Surrogate evaluation is good enough for unsafe tools	Many enterprise tools cannot execute in replay	It rubber-stamps bad calls or flags good calls	Golden set of unsafe tool scenarios scored by humans vs surrogate
Historical accepted cases are adequate for eval	Weak cases create false confidence	Too few accepted runs or repetitive cases	Analyze real run store coverage across skills

P1 Important

Assumption	Why It Matters	Validation
Publish gate thresholds are calibrated	Prevents both overblocking and underblocking	Run shadow mode for 2 weeks and compare human decisions
Preservation checker catches meaningful draft regressions	Revision safety depends on it	Compare section checker with manual diff review
Replay latency fits review workflow	Slow eval hurts adoption	Measure p50/p95 per case and per draft
Customers understand confidence and coverage language	Trust depends on clear communication	Customer-facing report comprehension test

P2 Later

Assumption	Why It Matters	Validation
Per-tool policy UI is needed	May not be needed in v1	Observe support/admin requests
Audit export becomes a buying requirement	Useful for enterprise sales	Ask pilot buyers during procurement review
Skill quality trend is a major retention driver	Useful after multiple versions exist	Measure repeat reviewer usage after v1

Opportunity Solution Tree

Desired outcome:

Increase trusted skill publication: at least 80% of approved skill drafts have replay or explicit skipped-provider evidence, zero known production side effects from replay, and reviewer decision time under 10 minutes for common drafts.

Outcome: Trusted skill publication

Opportunity 1: I need proof that a skill draft improves real task behavior.
  Solution 1.1: Baseline vs candidate replay on accepted historical tasks.
    Experiment: Run replay on 10 recent skill drafts and compare with manual reviewer judgment.
  Solution 1.2: Replay readiness score before evaluation starts.
    Experiment: Score existing candidates and check whether low-readiness reports are less useful.
  Solution 1.3: Regression triage queue.
    Experiment: Manually label failed cases for two weeks and measure fix rate.

Opportunity 2: I need replay to avoid unsafe side effects.
  Solution 2.1: Tool mode classification: executed, surrogate, blocked.
    Experiment: Golden tool policy test set covering filesystem, MCP, connectors, delete, send, publish.
  Solution 2.2: Isolated workspace and user-file namespace per arm.
    Experiment: Replay write task and verify no production paths change.
  Solution 2.3: Side-effect journal in each case report.
    Experiment: Security reviewer reads 5 reports and identifies all intended side effects.

Opportunity 3: I need reports I can act on.
  Solution 3.1: Reviewer decision summary with approve/revise/reject guidance.
    Experiment: First-click and decision-time test with reviewers.
  Solution 3.2: Coverage and confidence explanation.
    Experiment: Ask reviewers to explain report meaning after reading it.
  Solution 3.3: Preservation diff for revisions.
    Experiment: Seed dropped-instruction drafts and measure detection rate.

Validation Experiments

P0 Assumption	Hypothesis	Experiment	Cost	Duration	Success Criteria	Failure Criteria
Safe replay isolation	Replay can execute safe tools without touching production state	Build a replay fixture that writes, reads, sends, deletes, and publishes through classified tools	Medium	2-4 days	100% production paths untouched; destructive calls blocked; external writes surrogate	Any real external write or production path mutation
Reviewer decision value	Replay reports improve approval accuracy and speed	Give 5 reviewers 8 historical drafts with and without replay report	Low	2 days	Decision accuracy +25%; median decision time under 10 minutes	No improvement or reports misunderstood
Surrogate quality	Surrogate scoring agrees with human reviewer on unsafe tool calls	Create 30 unsafe-tool scenarios and compare human labels vs surrogate output	Medium	3-5 days	>=80% agreement on pass/fail; all high-risk bad calls flagged	High-risk false pass
Historical case adequacy	Accepted runs provide enough useful replay cases	Audit run store across top 10 skills/candidates	Low	1 day	>=70% candidates have >=3 meaningful accepted cases	Most candidates have no usable cases

Feature Prioritization

Must Have

Feature	Impact	Effort	Risk	Strategic Alignment
Eval report compatibility fields	High	Low	Low	Required foundation
Historical accepted case selection	High	Medium	Medium	Required for behavior evidence
Baseline vs candidate replay arms	High	High	High	Core value
Tool mode classification	High	Medium	High	Core trust boundary
Replay coverage and confidence report	High	Medium	Medium	Reviewer decision support
Publish gates for failed/low-confidence replay	High	Low	Medium	Governance promise
Preservation check for revise/merge drafts	Medium	Medium	Medium	Prevents silent instruction loss
Skills UI report summary	High	Medium	Medium	Adoption requirement

Should Have

Feature	Impact	Effort	Risk	Strategic Alignment
Reviewer decision summary	High	Medium	Medium	Converts evidence to action
Preservation diff view	Medium	Medium	Low	Improves reviewer comprehension
Replay readiness score	Medium	Medium	Medium	Prevents false confidence
Operational metrics dashboard	Medium	Medium	Low	Needed for maintenance
Golden tool policy test suite	High	Medium	Medium	Needed for safety assurance

Could Have

Feature	Impact	Effort	Risk	Strategic Alignment
Audit export	Medium	Medium	Low	Enterprise sales support
Skill quality trend	Medium	Medium	Medium	Useful after version history grows
Per-tool admin policy UI	Medium	High	Medium	Enterprise customization
Replay fixtures download	Low	Medium	Low	Debugging convenience

Not Yet

Feature	Reason
Full Docker orchestration per replay case	Too heavy for first release; design explicitly scopes it out
Production third-party write replay	Violates trust boundary
Removing human review	Replay evidence should support review, not replace it
Fully customizable policy UI	Add after policy needs are observed

Features to cut from v1:

Per-tool policy UI.
Audit export.
Skill quality trend.
Full Docker-per-case orchestration.

Features likely over-engineered for v1:

Customer-configurable replay policies before default policy is proven.
Complex statistical scoring before case quality and surrogate accuracy are validated.
Automatic publish for high-scoring drafts.

Metrics Dashboard

North Star Metric

Trusted Skill Publish Rate:

Approved skill drafts with usable eval evidence and no post-publish regression reports / total approved skill drafts, measured weekly.

Target for v1 pilot: >=80%.

Input Metrics

Metric	Definition	Data Source	Visualization	Target	Alert Threshold
Replay Evidence Coverage	Draft eval reports with mode `replay` or explicit skipped-provider status / all eval reports	Skill eval store	Weekly line	>=80%	<60% for 2 weeks
Executed Tool Coverage	Executed tool calls / all replay tool calls	Case reports	Stacked bar	>=50% for safe-tool skills	<25% for safe-tool skills
Surrogate Coverage	Surrogate tool calls / all replay tool calls	Case reports	Stacked bar	Transparent, not necessarily low	Sudden +30% week over week
Blocked Coverage	Blocked tool calls / all replay tool calls	Case reports	Stacked bar	<10%	>=25% or any blocked_coverage=1.0
Reviewer Decision Time	Time from eval report created to approve/reject/revise	Review events	Median and p95	Median <10 min	p95 >30 min
Replay Regression Rate	Reports with regression_count > 0 / replay reports	Eval store	Weekly line	Investigate, not zero-forced	>20%

Leading Indicators

Number of accepted runs eligible for replay per skill.
Percentage of candidates with at least 3 replay cases.
Provider unavailable skip rate.
Replay error or partial status rate.
Preservation failures per revised skill draft.

Guardrail Metrics

Guardrail	Definition	Alert
Production Side Effect Incidents	Any replay-caused write to production workspace, user files, credentials, or external systems	Immediate P0
False Pass Incidents	Published draft later confirmed to regress an accepted workflow despite passing replay	Weekly review; P1 if repeated
False Block Incidents	Useful draft blocked due to bad policy or low-confidence bug	Weekly review
Replay Latency	p95 replay completion time per draft	Alert if p95 >15 minutes in pilot
Report Comprehension	Reviewers correctly explain coverage/confidence in usability tests	Rework UI copy if <80%

Review Cadence

Daily during pilot: replay errors, side-effect alerts, provider skips.
Weekly: publish outcomes, regression rate, reviewer decision time, blocked/surrogate coverage.
Monthly: threshold calibration and customer feedback.
Quarterly: policy model, scoring model, and roadmap review.

Customer Research Plan

No customer interviews or support tickets were provided. Run research before treating demand and usability assumptions as validated.

Research Participants

3-5 internal skill reviewers or admins.
3 workflow owners who want accepted tasks converted into reusable skills.
2 enterprise/security stakeholders who review AI governance.
2 engineers/operators responsible for deployment and incident response.

Research Questions

What evidence do reviewers need before approving a reusable skill?
Which replay report fields are meaningful, and which are noise?
Do users understand executed vs surrogate vs blocked coverage?
What level of uncertainty is acceptable for publishing?
What customer-facing proof is needed for enterprise pilots?
Which tool categories must never execute during replay?

Recommended Actions

Run a moderated reviewer test with current Skills page report.
Create 5 seeded draft cases: clear improvement, clear regression, unsafe external write, preservation drop, provider unavailable.
Ask participants to approve/revise/reject each case and explain why.
Compare their decisions with current publish gate behavior.

Interview Guide

Objectives

Validate whether replay evidence changes approval behavior.
Identify confusing report language.
Understand risk tolerance for surrogate and blocked calls.
Learn what artifacts enterprise buyers need for adoption.

Warm-Up

Tell me about the last time you reviewed or approved reusable AI guidance, prompts, tools, or workflows.
What made the approval easy or hard?
What happened after it was approved?

JTBD Questions

Walk me through the last time an AI workflow worked well enough that you wanted to reuse it.
What evidence did you have that it would work again?
What would make you hesitate to publish it for others?
What does "safe to publish" mean in your environment?

Behavioral Questions

Show me how you would decide whether this draft should be approved.
Which part of this report would you read first?
What would you ignore?
What would you ask an engineer to explain?

Risk Validation Questions

If a replay report says 70% executed and 30% surrogate, what decision would you make?
If all important external writes were surrogate-evaluated, is that enough for review?
Which tools should always be blocked in your environment?
What kind of failure would make you disable replay eval?

Note Template

Participant:
Role:
Date:
Last relevant review:
Decision evidence needed:
Confusing report fields:
Risk tolerance:
Must-block tool categories:
Minimum publish evidence:
Unexpected insight:
Follow-up:

Recommended Next 30 Days

Validate replay isolation with a golden tool policy suite.
Run current backend unit tests around skill learning replay and publish gates.
Add a small reviewer decision summary above raw replay details.
Run 5-8 reviewer usability sessions using seeded draft cases.
Audit accepted run coverage for top skills and identify gaps.
Decide v1 gate thresholds for blocked coverage, confidence, and preservation failure.
Add operational logging and metrics for replay status, latency, and provider skips.

Recommended Next 90 Days

Replace or augment deterministic surrogate scoring with structured LLM judgment and human-labeled calibration cases.
Add replay readiness scoring before eval starts.
Improve preservation from section presence to diff-based critical instruction detection.
Add customer/exportable audit summary for enterprise pilot conversations.
Build a replay operations dashboard.
Introduce deployment-level policy profiles only after default policies produce stable data.
Track skill quality across versions and post-publish regression reports.

Biggest Risks

Risk	Severity	Mitigation
Replay accidentally mutates production state	Critical	Golden policy tests, isolated namespaces, external writes surrogate by default, P0 alert
Surrogate scoring gives false confidence	High	Human-labeled calibration set, show low confidence clearly, no automatic publish
Reviewers ignore report complexity	High	Decision summary, comprehension testing, action-oriented UI copy
Accepted run data is too sparse	High	Readiness score, fallback to explicit skipped/low-evidence state, collect more accepted cases
Publish gates block too many useful skills	Medium	Shadow mode calibration and override with explicit review rationale
Evaluation costs or latency grow quickly	Medium	Cap cases, cache web/search, track p95 latency, async background eval

Recommended Immediate Actions

Treat Skill Replay Eval as a v1 trust gate, not a complete benchmark.
Keep human review mandatory for publish.
Do not execute production third-party writes during automatic replay.
Add reviewer-facing explanations before adding more raw report data.
Validate isolation and surrogate quality before broad rollout.
Use the first pilot to learn threshold calibration, not to claim perfect quality measurement.

29 KiB Raw Blame History