feat: 支持多语言提示词本地化和界面优化

- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化
- 移除内置 agents 配置以简化系统架构
- 更新 ContextBuilder 使用动态提示词模板而非硬编码内容
- 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数
- 添加输出语言指令确保用户界面内容按指定语言生成
- 扩展前端 LanguageSwitcher 组件支持三种语言选项
- 优化 Header 和侧边栏组件的响应式布局和文本截断处理
- 更新测试用例验证不同语言环境下的提示词正确性
This commit is contained in:
2026-06-10 16:11:05 +08:00
parent 9cc3334ea7
commit fc9fd93c36
51 changed files with 7493 additions and 619 deletions

View File

@ -0,0 +1,387 @@
# PRD: Skill Replay Eval
Date: 2026-06-09
Status: Product discovery complete; implementation validation required
## 1. Summary
Skill Replay Eval is Beaver's evidence-based quality gate for reusable Agent skills. It evaluates a skill draft against accepted historical task runs, compares baseline and candidate behavior, reports execution/surrogate/blocked tool coverage, checks preservation for revised skills, and helps reviewers decide whether a draft can be published.
The goal is not to replace human review. The goal is to make review decisions safer, faster, and grounded in real task behavior.
## 2. Contacts
| Role | Owner | Comment |
| --- | --- | --- |
| Product | TBD | Owns scope, rollout, customer research, metrics |
| Engineering | TBD | Owns replay runner, tool policy, eval report, UI wiring |
| Design | TBD | Owns reviewer decision flow and report comprehension |
| Security / IT reviewer | TBD | Owns replay side-effect policy and launch approval |
| Customer pilot lead | TBD | Owns pilot participant selection and feedback loop |
## 3. Background
Beaver's product promise is that successful AI tasks can become reusable skills. This is valuable only if skill publishing is trustworthy. The current heuristic evaluator can estimate draft quality from text and accepted run metadata, but it cannot prove the draft behaves correctly in realistic tasks. It also cannot reliably detect tool misuse, unsafe side effects, or missing instructions in revised skills.
The new design introduces replay-style evaluation:
- Select accepted historical task cases.
- Run a baseline arm and a candidate arm.
- Execute safe tools in a replay context.
- Record unsafe or unavailable tools for surrogate judgment.
- Block destructive actions.
- Aggregate score, coverage, confidence, regressions, and preservation risk.
- Show the report in the Skills review page.
- Use publish gates to prevent low-confidence or unsafe releases.
Why now:
- Beaver already has task evidence, accepted runs, skill candidates, skill drafts, safety reports, eval reports, review, and publish flow.
- The customer-facing story emphasizes enterprise governance and reusable skills.
- Without stronger eval, the skill-learning loop can create risk instead of trust.
## 4. Objective
### Objective
Make skill publishing evidence-based and safe enough for enterprise pilot use.
### Why It Matters
For customers, Skill Replay Eval turns Beaver from "an Agent that can learn" into "an Agent platform with controlled learning." For the team, it reduces blind publish risk and creates a repeatable way to improve skill quality.
### Key Results
| Key Result | Target |
| --- | --- |
| Trusted Skill Publish Rate | >=80% of approved drafts have replay evidence or explicit skipped-provider evidence during pilot |
| Replay Side-Effect Safety | 0 production side-effect incidents caused by replay |
| Reviewer Decision Time | Median approve/reject/revise decision under 10 minutes for common drafts |
| Report Comprehension | >=80% of reviewers correctly explain execution, surrogate, blocked, and confidence meanings in usability tests |
| Regression Visibility | 100% of replay reports expose regression count, score delta, and case-level details |
| Preservation Visibility | 100% of revise/merge replay reports with base content expose preservation result |
## 5. Market Segments
### Primary Segment: Enterprise AI Platform Teams
They want private or controlled Agent deployment, reusable workflows, governance, and auditability. They need evidence before reusable skills are distributed.
### Secondary Segment: Internal Workflow Teams
They run repeatable knowledge workflows such as reports, support, project delivery, file processing, or research. They want accepted AI work to become reusable without manual prompt engineering every time.
### Internal Segment: Beaver Operators And Engineers
They need debuggable replay behavior, predictable tool policies, and operational visibility.
### Constraints
- Replay must not execute production external writes by default.
- Replay should use existing stores and skill learning pipeline where possible.
- Evaluation report payload must remain compatible with existing UI and stored reports.
- First release should cap replay case count to control latency and cost.
- Human review remains mandatory.
## 6. Value Propositions
### For Skill Reviewers
Pain avoided: approving a skill by reading text only.
Gain: see whether the candidate improves, regresses, or preserves behavior on accepted tasks.
### For Enterprise Admins
Pain avoided: uncontrolled AI learning that silently changes team behavior.
Gain: clear publish gates, safety report, replay report, coverage, confidence, and preservation evidence.
### For Workflow Owners
Pain avoided: successful task patterns disappearing into chat history.
Gain: accepted work can become reusable skills with validation before reuse.
### For Engineers
Pain avoided: debugging vague "skill quality" complaints.
Gain: case-level traces, tool classifications, side effects, and reproducible failure categories.
## 7. Solution
### 7.1 UX / User Flow
Primary reviewer flow:
```text
Skill candidate generated
-> draft created
-> safety report generated
-> replay eval report generated
-> reviewer opens Skills draft page
-> reviewer reads summary: pass/fail, baseline, candidate, delta, coverage, confidence
-> reviewer drills into cases, tool calls, side effects, preservation report
-> reviewer approves, requests revision, or rejects
-> publish gate enforces safety, eval, confidence, blocked coverage, preservation
```
Required UI behavior:
- Show report status first: passed, failed, skipped provider, replay error, or partial.
- Show baseline average, candidate average, and score delta.
- Show execution coverage, surrogate coverage, blocked coverage, and confidence.
- Show improved, regressed, and unchanged case counts.
- Show replay cases in a compact table.
- Show raw case reports only after the summary.
- Show preservation report for revise/merge drafts.
- Use clear wording for skipped-provider reports: no replay evidence was run.
Recommended UI improvement:
- Add a reviewer decision summary above raw details:
- "Recommended action: Approve / Revise / Reject / Needs manual review"
- "Reason: low confidence, preservation failure, regression, or blocked calls"
### 7.2 Key Features
#### Historical Case Selection
Requirements:
- Select up to 10 accepted historical runs.
- For revised skills, prefer accepted runs that activated the target skill/version.
- For new skills, use candidate source runs or similar task themes.
- For merged skills, use accepted runs where related skills co-activated.
- Prefer recent accepted runs and diversify repeated tasks.
Acceptance criteria:
- Case selection returns no more than 10 cases.
- Failed or unaccepted runs are excluded.
- Baseline skill names are populated for revise and merge candidates.
#### Baseline And Candidate Replay Arms
Requirements:
- Run the same task text for both arms.
- Use the same model settings, bounded historical context, max tool iterations, and replay policy.
- Baseline arm uses no skill, old skill, or related old skills depending on candidate type.
- Candidate arm injects the draft as pinned draft guidance.
Acceptance criteria:
- Both arms produce run id, session id, final answer, finish reason, tool calls, side effects, and artifacts.
- Replay runs are marked with source `skill_replay_eval`.
- Replay does not create user-visible normal task sessions.
#### Tool Mode Classification
Requirements:
- Classify each tool call as:
- `executed`: safe to execute in replay context.
- `surrogate`: unsafe/unavailable to execute but can be judged from intended call.
- `blocked`: cannot safely execute or judge.
- Safe defaults include filesystem, user files, core, web, and search where isolation is available.
- External writes and connector/MCP write actions default to surrogate.
- Destructive operations default to blocked.
Acceptance criteria:
- Each tool trace includes tool name, arguments, schema, toolset, metadata, mode, classification reason, and result.
- Destructive terms such as delete/remove/destroy/revoke/permission/credential/payment/pay are blocked.
- External write terms such as send/post/publish/create/update/invite/reply/forward are not executed against production systems by default.
#### Surrogate Evaluation
Requirements:
- Score baseline and candidate intended tool use when tools are surrogate or blocked.
- Include task text, tool schema, arguments, classification reason, final answer, and side effects in judgment payload.
- Lower confidence when surrogate or blocked coverage is high.
Acceptance criteria:
- Reports include baseline score, candidate score, delta, confidence, and validator notes.
- Blocked calls reduce score and confidence.
- Surrogate scoring is transparent and does not pretend to be real execution.
#### Preservation Check
Requirements:
- For revise and merge drafts, compare base skill content against proposed draft content.
- Report preserved sections, changed sections, dropped sections, pass/fail, and risk level.
- Failed preservation blocks publish.
Acceptance criteria:
- Revision drafts with dropped important sections fail preservation.
- Reports are visible in the Skills UI.
- Publish gate blocks failed preservation.
#### Eval Report Model
Requirements:
- Extend existing `SkillDraftEvalReport` without breaking legacy reports.
- Keep existing fields: passed, baseline_score_avg, candidate_score_avg, score_delta, regression_count, improved_count, unchanged_count, cases, status.
- Add replay fields: eval_version, mode, execution_coverage, surrogate_coverage, blocked_coverage, confidence, case_reports, tool_mode_summary, preservation_report.
Acceptance criteria:
- Legacy reports deserialize with default replay fields.
- New reports serialize all replay fields.
- Frontend type definitions include replay fields.
#### Publish Gates
Requirements:
- Draft must still have approved review and passing safety report.
- Failed eval report blocks publish except explicit skipped-provider status.
- Replay report with low confidence blocks publish.
- Replay report with blocked coverage >=1.0 blocks publish.
- Failed preservation blocks publish.
Acceptance criteria:
- Publish attempts fail with clear errors for each gate condition.
- Skipped provider is visible and does not silently claim replay passed.
### 7.3 Technology
Backend:
- Python dataclasses.
- Existing file-backed memory stores.
- `SkillLearningPipelineService.evaluate_draft()`.
- `SkillDraftEvaluator`.
- `ReplayRunner`, `ReplayToolExecutor`, `ReplayToolPolicy`.
- `SurrogateToolEvaluator`.
- `SkillDraftEvalReport`.
- FastAPI endpoint wiring through existing Skills APIs.
Frontend:
- Next.js / TypeScript Skills page.
- Existing design system and report card patterns.
- Typed replay report fields in `types/index.ts`.
Testing:
- Unit tests for eval report compatibility.
- Case selection tests.
- Preservation tests.
- Replay executor and replay runner tests.
- Agent loop replay executor override tests.
- Surrogate scoring tests.
- Pipeline publish gate tests.
- Frontend smoke/manual review for report rendering.
### 7.4 Data Model
Eval report fields:
| Field | Type | Purpose |
| --- | --- | --- |
| `eval_version` | string | Version of eval model, e.g. `replay-v1` |
| `mode` | string | `heuristic` or `replay` |
| `execution_coverage` | number | Share of replay tool calls actually executed |
| `surrogate_coverage` | number | Share judged through surrogate |
| `blocked_coverage` | number | Share blocked |
| `confidence` | string | low, medium, high |
| `case_reports` | array | Detailed baseline/candidate case reports |
| `tool_mode_summary` | object | Aggregate tool mode counts |
| `preservation_report` | object/null | Preservation result for revise/merge |
### 7.5 Assumptions
- Accepted historical runs exist and are useful.
- Replay can be isolated enough for safe tool execution.
- Reviewers understand and trust the report after UI iteration.
- Surrogate scoring can be improved over time without blocking v1.
- Publish gates can be calibrated during pilot.
### 7.6 Non-Goals
- No production third-party writes during automatic replay.
- No automatic publishing based only on replay score.
- No full Docker orchestration per replay case in v1.
- No customer-configurable per-tool policy UI in v1.
- No replacement of human review.
- No claim that replay is a complete benchmark of all future tasks.
## 8. Release
### V0: Internal Validation
Scope:
- Current replay report fields.
- Current case selection.
- Current replay runner integration.
- Current tool policy.
- Current Skills UI report display.
- Current publish gates.
Exit criteria:
- Unit tests pass for skill learning replay surface.
- Golden tool policy tests prove no production side effects.
- Reviewer can make decisions from 5 seeded cases.
- Known limitations are documented.
### V1: Pilot Release
Scope:
- Reviewer decision summary.
- Replay readiness indicator.
- Better preservation diff.
- Operational metrics for replay status, latency, provider skip, blocked coverage.
- Customer-facing explanation for replay evidence and confidence.
Exit criteria:
- 0 replay side-effect incidents.
- >=80% reviewer comprehension in usability test.
- Median reviewer decision time under 10 minutes.
- Pilot admins accept report as sufficient review support.
### V2: Enterprise Hardening
Scope:
- LLM surrogate evaluator with human-labeled calibration.
- Policy profiles by deployment risk tier.
- Audit export.
- Skill quality trend across versions.
- Replay operations dashboard.
Exit criteria:
- Human vs surrogate agreement >=80% on unsafe tool golden set.
- Clear process for policy changes and incident review.
- Enterprise pilot customers can use audit export in governance review.
## Open Questions
- What minimum replay case count should be required before a report is considered useful?
- Should skipped-provider reports block publish in regulated deployments?
- What exact confidence levels should map to publish gate behavior?
- Which toolsets are safe in each deployment mode?
- How should reviewer overrides be recorded when they publish despite weak evidence?
- What is the long-term storage retention policy for replay traces and artifacts?
## Success Review Checklist
- Product: Does the report answer "should this skill be published?"
- Design: Can reviewers understand the summary without reading raw JSON?
- Engineering: Can replay failures be reproduced and diagnosed?
- Security: Can replay prove no production side effects by default?
- Customer: Does this strengthen Beaver's enterprise trust story?

View File

@ -0,0 +1,13 @@
# Skill Replay Eval Product Discovery
This folder turns the Skill Replay Eval design into product-facing planning artifacts.
- [Product Discovery Report](./product-discovery-report.md): opportunity, users, assumptions, experiments, feature priority, metrics, and 30/90 day recommendations.
- [PRD](./PRD-skill-replay-eval.md): product requirements for engineering, design, review, validation, and release scope.
- [Launch And Maintenance Runbook](./launch-maintenance-runbook.md): rollout, readiness checks, operational ownership, alerting, and maintenance cadence.
Related source material:
- [Skill Replay Eval Design](../../superpowers/specs/2026-06-08-skill-replay-eval-design.md)
- [Skill Replay Eval Implementation Plan](../../superpowers/plans/2026-06-08-skill-replay-eval.md)
- [Beaver customer presentation](../../presentations/skill-replay-eval/index.html)

View File

@ -0,0 +1,356 @@
# Skill Replay Eval Launch And Maintenance Runbook
Date: 2026-06-09
Purpose: define how to validate, launch, operate, and maintain Skill Replay Eval safely.
## 1. Launch Principle
Ship Skill Replay Eval as a guarded trust feature.
The system may help reviewers approve or reject a skill draft, but it must not create false certainty. When evidence is weak, the product should say so clearly. When tool safety is unclear, replay should prefer surrogate or blocked modes over production execution.
## 2. Ownership
| Area | Owner | Responsibility |
| --- | --- | --- |
| Product quality | Product owner | Metrics, pilot feedback, publish threshold decisions |
| Replay pipeline | Backend engineer | Case selection, replay runner, scoring, report persistence |
| Tool safety policy | Backend + security reviewer | Tool classification, blocked/surrogate rules, side-effect tests |
| Skills UI | Frontend/design owner | Report summary, reviewer decision flow, report readability |
| Operations | Deployment owner | Logs, alerts, provider availability, incident response |
| Customer pilot | Pilot lead | Participant selection, feedback, rollout communication |
## 3. Pre-Launch Readiness
### Required Code Checks
Run backend tests from `app-instance/backend`:
```bash
pytest tests/unit/test_skill_learning_eval_report_model.py -v
pytest tests/unit/test_skill_learning_case_selection.py -v
pytest tests/unit/test_skill_learning_preservation.py -v
pytest tests/unit/test_skill_learning_replay.py -v
pytest tests/unit/test_skill_learning_replay_runner.py -v
pytest tests/unit/test_agent_loop_replay_executor.py -v
pytest tests/unit/test_skill_learning_surrogate.py -v
pytest tests/unit/test_skill_learning_eval.py -v
pytest tests/unit/test_skill_learning_pipeline.py -v
```
Run frontend verification from `app-instance/frontend`:
```bash
npm run lint
npm run test -- --runInBand
```
If frontend tests are not configured, perform manual Skills page verification with seeded report payloads.
### Golden Safety Cases
Before pilot launch, create or manually verify a golden set with these cases:
| Case | Expected Result |
| --- | --- |
| Safe filesystem read | `executed` |
| Safe filesystem write to replay workspace | `executed`, no production write |
| User-file write in replay namespace | `executed` only if isolated, otherwise `surrogate` |
| Web/search read | `executed` or cached read |
| Email send | `surrogate` |
| Calendar invite | `surrogate` |
| Connector publish/post/reply | `surrogate` |
| Delete/remove/destroy | `blocked` |
| Permission/credential/payment action | `blocked` |
Launch blocker:
- Any replay case mutates production workspace, user files, credentials, external accounts, permissions, or payment state.
### Report Readiness Checks
Each replay report must show:
- Eval status.
- Baseline average.
- Candidate average.
- Score delta.
- Improved/regressed/unchanged counts.
- Execution coverage.
- Surrogate coverage.
- Blocked coverage.
- Confidence.
- Replay cases.
- Case reports.
- Preservation report when applicable.
- Raw report for debugging.
### Publish Gate Checks
Publish must fail when:
- No approved review exists.
- Safety report is missing or failed.
- Eval report failed, except explicit skipped-provider status.
- Replay confidence is low.
- Replay blocked coverage is `1.0`.
- Preservation report failed.
Publish may proceed with explicit human review when:
- Provider is unavailable and eval status is `skipped_provider_unavailable`.
- Replay evidence is partial, but reviewer records a rationale and deployment policy allows it.
## 4. Rollout Plan
### Phase 0: Shadow Mode
Audience: internal team only.
Duration: 1 week or 10 draft evaluations, whichever comes first.
Behavior:
- Generate replay reports.
- Do not change existing publish decisions unless a critical safety issue appears.
- Compare replay recommendation with human reviewer decision.
Exit criteria:
- No production side effects.
- No unexplained replay crashes on common drafts.
- Reviewers can explain report meaning.
- Product owner reviews gate threshold data.
### Phase 1: Strict Internal Gate
Audience: internal maintainers and trusted reviewers.
Behavior:
- Enforce low-confidence, blocked coverage, failed preservation, failed eval, and failed safety gates.
- Require manual rationale for skipped-provider publish.
Exit criteria:
- 0 P0 incidents.
- Publish blockers are actionable and not noisy.
- Reviewer median decision time under 10 minutes for common drafts.
### Phase 2: Pilot Customer Gate
Audience: selected pilot customer or internal department.
Behavior:
- Keep human review mandatory.
- Provide customer-facing explanation of replay evidence.
- Track skipped-provider and low-confidence cases closely.
Exit criteria:
- Pilot admin accepts report as useful governance evidence.
- No side-effect incidents.
- Top confusion points are documented and scheduled for UI copy/design improvements.
### Phase 3: General Availability Candidate
Audience: all enabled deployments.
Behavior:
- Replay Eval enabled by default where provider and case data are available.
- Skipped-provider state remains explicit.
- Tool policy remains conservative.
Exit criteria:
- Operational dashboard exists.
- Incident response is rehearsed.
- Policy change process is documented.
## 5. Monitoring
### Product Metrics
| Metric | Owner | Cadence | Alert |
| --- | --- | --- | --- |
| Trusted Skill Publish Rate | Product | Weekly | <60% for 2 weeks |
| Reviewer Decision Time | Product/design | Weekly | p95 >30 minutes |
| Replay Regression Rate | Product/engineering | Weekly | >20% of replay reports |
| Report Comprehension | Product/design | Per research round | <80% explain coverage/confidence correctly |
### Operational Metrics
| Metric | Owner | Cadence | Alert |
| --- | --- | --- | --- |
| Replay status counts | Engineering | Daily during pilot | Any spike in `replay_error` or `partial` |
| Provider unavailable skip rate | Operations | Daily | >25% of evals in pilot |
| Replay latency p50/p95 | Engineering | Daily | p95 >15 minutes |
| Blocked coverage | Security/engineering | Weekly | Any report with blocked_coverage=1.0 |
| Production side-effect incidents | Security/operations | Immediate | Any nonzero event |
| Failed preservation reports | Product/engineering | Weekly | Spike after synthesizer change |
### Logs To Inspect
- Skill learning candidate events.
- Draft creation and safety report events.
- Eval report generation events.
- Replay arm run ids and source `skill_replay_eval`.
- Tool traces and classification reasons.
- Publish gate errors.
- Provider unavailable errors.
## 6. Incident Response
### P0: Production Side Effect During Replay
Examples:
- Email sent.
- Calendar invite created.
- External connector publish/post/reply happened.
- Production file or credential changed.
- Permission/payment action executed.
Immediate actions:
1. Disable replay eval generation.
2. Disable skill publish if policy risk is unclear.
3. Preserve logs, replay traces, eval reports, and affected tool metadata.
4. Identify tool name, toolset, metadata, classification reason, arguments, and tenant.
5. Patch policy to block or surrogate affected class.
6. Add a regression test to golden safety cases.
7. Notify pilot/customer owner if customer data or systems were affected.
Restart criteria:
- Root cause documented.
- Regression test passes.
- Security owner approves restart.
### P1: False Pass
Definition: draft passed replay and was published, then confirmed to regress a real accepted workflow.
Actions:
1. Unpublish or revert skill version if impact is active.
2. Add the failed task as a replay case.
3. Inspect whether case selection missed the scenario or scoring overrated it.
4. Adjust gate threshold, surrogate scoring, or preservation check.
5. Record postmortem in skill quality log.
### P1: False Block
Definition: useful draft blocked due to bad replay policy, low-confidence bug, or report construction issue.
Actions:
1. Do not bypass silently; record reviewer rationale.
2. Identify blocking rule and trace.
3. Add regression test if policy bug.
4. Decide whether threshold should change or case should remain blocked.
### P2: Provider Unavailable Spike
Actions:
1. Check provider configuration and model availability.
2. Confirm fallback status is explicit.
3. Track how many publish decisions rely on skipped-provider.
4. Pause broad rollout if skipped-provider exceeds pilot threshold.
## 7. Maintenance Cadence
### Daily During Pilot
- Check replay errors and provider skips.
- Check blocked_coverage=1.0 reports.
- Confirm no side-effect incidents.
- Review new publish gate failures.
### Weekly
- Review metrics dashboard.
- Calibrate publish gate thresholds.
- Review 3-5 replay reports for readability.
- Inspect false pass/false block candidates.
- Update tool policy based on new tools or connectors.
### Monthly
- Review customer/pilot feedback.
- Refresh golden safety cases.
- Sample preservation reports for missed instruction drops.
- Review storage growth from replay case reports and traces.
- Decide whether to promote features from Should Have to Must Have.
### Quarterly
- Revisit risk model and tool policy profiles.
- Review whether LLM surrogate calibration meets quality target.
- Decide whether to add audit export or per-deployment policy UI.
- Retire stale replay cases or update case selection logic.
## 8. Data Retention And Privacy
Replay reports may contain task text, tool arguments, schemas, final answers, and side-effect descriptions. Treat them as sensitive operational data.
Recommended policy:
- Store summarized report for normal review.
- Limit raw case report retention or restrict access to admins.
- Redact credentials, tokens, secrets, and obvious personal identifiers from tool arguments before display where possible.
- Do not include production external write results because they should not execute.
- Define tenant-specific retention before enterprise rollout.
## 9. Release Communication
### Internal Message
Skill Replay Eval adds evidence to skill publishing. Reviewers will now see whether a draft improved, regressed, or preserved accepted task behavior. Reports disclose what executed, what was judged by surrogate, what was blocked, and whether revised skills preserved important sections.
### Customer / Pilot Message
Beaver can now evaluate reusable skill drafts against prior accepted work before publication. The report shows both confidence and uncertainty. Unsafe external actions are not executed automatically during replay; they are recorded for review or blocked by policy.
### Known Limitations To Disclose
- Replay quality depends on available accepted historical runs.
- Surrogate evaluation is not the same as real execution.
- Low-confidence reports require more human review.
- Human approval is still required.
- First release does not include per-tool policy UI or full per-case container orchestration.
## 10. Rollback Plan
Rollback options:
1. Disable replay runner injection and fall back to heuristic eval.
2. Keep report fields but set mode to `heuristic`.
3. Keep publish gate requiring safety and human review.
4. Temporarily treat replay errors as non-blocking only if security owner confirms no side-effect risk.
5. Preserve failed replay reports for debugging.
Rollback triggers:
- Any P0 side-effect incident.
- Repeated replay errors that block normal skill review.
- Provider unavailable spike that makes most reports skipped.
- Reviewer decision time becomes unacceptable and no quick UI fix exists.
## 11. Launch Checklist
- [ ] Backend replay tests pass.
- [ ] Frontend report rendering verified.
- [ ] Golden tool safety cases pass.
- [ ] No production side-effect path found.
- [ ] Publish gates tested manually.
- [ ] Skipped-provider copy is clear.
- [ ] Reviewer decision summary exists or is tracked as a launch follow-up.
- [ ] Pilot participants selected.
- [ ] Metrics dashboard owner assigned.
- [ ] Incident owner and escalation path assigned.
- [ ] Rollback path verified.

View File

@ -0,0 +1,512 @@
# Skill Replay Eval Product Discovery Report
Date: 2026-06-09
Product stage: existing product
Primary feature: Skill Replay Eval for Beaver skill learning and publishing
Source context:
- Existing product and deployment: `README.md`, `部署指南.md`
- Feature design: `docs/superpowers/specs/2026-06-08-skill-replay-eval-design.md`
- Delivery plan: `docs/superpowers/plans/2026-06-08-skill-replay-eval.md`
- Current implementation signals: `beaver/skills/learning/{case_selection,preservation,replay,surrogate,eval}.py`, Skills page replay report UI, publish gate checks
- Customer positioning: `docs/presentations/skill-replay-eval/index.html`
## Executive Summary
Beaver is positioned as an enterprise Agent execution and governance platform. Its core value is not only running tasks, but also making AI work traceable, acceptable, reusable, and governable. Skill Replay Eval is the quality gate that makes the "reusable skill" promise credible: before a skill draft is published, Beaver should test whether it improves or preserves real task behavior.
The current design correctly identifies the product risk: heuristic-only skill scoring is not enough for enterprise trust. A draft skill can look complete in text while causing tool misuse, dropping safety instructions, or regressing accepted workflows. Replay evaluation closes this gap by comparing baseline and candidate behavior on accepted historical tasks, classifying tool calls into executed, surrogate, or blocked modes, and adding a preservation check for revised skills.
The product direction should be: ship replay eval as a staged trust feature, not as a perfect benchmark system. The first release should make evaluation coverage and uncertainty visible, block obvious regressions, and give reviewers enough evidence to approve or reject drafts. The next releases should improve case quality, sandbox isolation, surrogate judgment quality, and operational dashboards.
## Product Summary
### Product Description
Skill Replay Eval is a review and publishing gate for Beaver skills. It evaluates skill drafts against prior accepted task runs and shows whether the draft improves, preserves, or harms real task outcomes. It separates safe tool execution from surrogate evaluation for unsafe or unavailable tools, and it checks whether revised skill drafts preserve important original instructions.
### Target Users
| Segment | Job To Be Done | Success Looks Like |
| --- | --- | --- |
| Enterprise AI platform owner | Govern reusable Agent capabilities before they spread across teams | No risky skill is published without evidence, review, and audit trail |
| Skill reviewer / admin | Decide whether a skill draft is good enough to approve | Replay report explains score, coverage, regressions, and preservation risks |
| Internal workflow owner | Convert accepted tasks into repeatable team methods | Similar future tasks become faster and more reliable |
| Engineer / implementer | Build and debug the eval pipeline | Replay failures are reproducible, scoped, and observable |
| Security / IT reviewer | Understand side effects and tool risk | Production writes are not executed during automatic replay |
### Current Features
Existing Beaver product capabilities relevant to this feature:
- Task lifecycle: route, plan, execute, track, accept, modify, or abandon.
- Evidence and timeline: tool calls, artifacts, task status, and validation signals.
- Skill learning: candidates, drafts, safety report, eval report, review, publish.
- Multi-instance deployment: isolated `app-instance` per user/team via Docker.
- Tool and connector framework: local tools, MCP tools, external connectors, files, web/search, scheduled tasks.
Current Skill Replay Eval implementation signals:
- `SkillDraftEvalReport` has replay fields: mode, eval version, execution coverage, surrogate coverage, blocked coverage, confidence, case reports, tool mode summary, and preservation report.
- `select_replay_cases()` selects up to 10 accepted historical runs by candidate type.
- `ReplayToolExecutor` classifies tool calls as executed, surrogate, or blocked.
- `ReplayRunner` runs baseline and candidate arms through AgentLoop with a replay tool executor.
- `SurrogateToolEvaluator` scores non-executed calls through deterministic intended-call heuristics.
- Publish gates block low-confidence replay reports, fully blocked replay reports, and failed preservation reports.
- Skills UI exposes execution coverage, surrogate coverage, confidence, replay cases, raw case reports, and preservation reports.
### Current Architecture
```text
Accepted task runs
-> SkillLearningCandidate
-> SkillDraft
-> case selection
-> baseline arm and candidate arm
-> replay tool executor
-> executed tools for safe toolsets
-> surrogate traces for external writes or unsafe integrations
-> blocked traces for destructive calls
-> surrogate scoring and coverage aggregation
-> preservation checker for revise/merge
-> SkillDraftEvalReport
-> Skills review UI
-> publish gate
```
Product boundary:
- Replay Eval should evaluate skill behavior, not replace human review.
- Replay Eval should never write to production workspace, user files, external accounts, third-party systems, credentials, permissions, or payments by default.
- Low confidence should increase review burden instead of creating false certainty.
### Current Value Proposition
For enterprise users, Beaver can say: "Accepted work can become reusable skills, and those skills are checked against real task behavior before they are published." This directly supports Beaver's larger promise of controlled, traceable, reusable Agent execution.
### Current Challenges
| Challenge | Product Impact | Current Risk |
| --- | --- | --- |
| Historical accepted runs may be sparse or low quality | Replay evidence can be weak | Medium |
| Surrogate scoring is currently simple | Unsafe tool calls may be judged with low fidelity | High |
| Replay environment isolation must be enforceable | Enterprise trust depends on no accidental production side effects | High |
| Reviewers need clear explanations | Raw case reports can overwhelm non-engineers | Medium |
| Publish gates may be too strict or too loose | Either slows adoption or lets regressions through | Medium |
| Skill preservation is section-based | Important instruction changes inside a section may be missed | Medium |
## Missing Information And Ambiguities
- No real customer interview data is provided for skill reviewers, enterprise admins, or workflow owners.
- No baseline metrics exist for current heuristic eval false positives or false negatives.
- No defined quality threshold exists for minimum acceptable replay coverage per skill category.
- No clear operational owner is assigned for replay failures, low confidence reports, or blocked tool classifications.
- No explicit policy matrix exists per toolset, customer deployment mode, or tenant risk tier.
- No customer-facing language has been finalized for explaining surrogate evaluation limitations.
## User Segments
### Segment 1: Skill Governance Admin
This user owns skill approval. They need a reliable way to decide whether a skill should be published. Their main pain is that a skill draft can appear well-written but still fail on real tasks.
### Segment 2: Enterprise AI Platform Buyer
This user evaluates Beaver as an internal AI platform. They care about risk, adoption, cost, governance, and operational control. They need to see that reusable Agent capabilities are not published blindly.
### Segment 3: Workflow Owner
This user has repeatable work such as weekly reports, project delivery, technical support, or file processing. They want accepted workflows to become faster and more consistent over time.
### Segment 4: Beaver Engineer / Operator
This user debugs replay failures, expands safe tool coverage, adjusts publish gates, and keeps the eval pipeline reliable.
## JTBD
| User | Job Story | Current Alternative | Desired Outcome |
| --- | --- | --- | --- |
| Skill reviewer | When a skill draft is ready, I want to see whether it works on prior accepted tasks, so I can approve it with evidence | Read the draft manually | Approve, reject, or revise with confidence |
| Admin | When a skill touches tools, I want to know what would execute, what is simulated, and what is blocked, so I can manage risk | Trust reviewer judgment | Clear coverage and side-effect evidence |
| Workflow owner | When my accepted task becomes a reusable skill, I want it to preserve what made the original task successful | Rewrite prompts manually | Similar future work gets better |
| Operator | When replay fails, I want to know whether the issue is provider, tool policy, case data, or candidate behavior | Read logs manually | Fast diagnosis and recovery |
## Alternative Product Positioning
| Positioning | Strength | Weakness | Recommendation |
| --- | --- | --- | --- |
| "Skill unit tests for Agents" | Easy for engineers to understand | Too narrow; suggests deterministic tests only | Use in engineering docs |
| "Replay-based skill quality gate" | Accurate and product-relevant | Needs explanation for non-technical buyers | Primary internal positioning |
| "Enterprise Agent governance evidence" | Strong for buyers | Less precise for builders | Use in sales and customer docs |
| "A/B testing for skill drafts" | Captures baseline vs candidate | May imply live user traffic experiments | Use carefully |
Recommended positioning:
> Skill Replay Eval is Beaver's evidence-based quality gate for reusable Agent skills. It replays accepted historical tasks, compares baseline and candidate behavior, and exposes execution coverage, surrogate coverage, regressions, and preservation risk before publication.
## Opportunity Areas
| Opportunity | Importance | Current Satisfaction | Opportunity Score | Notes |
| --- | ---: | ---: | ---: | --- |
| I need proof that a skill draft improves real task behavior | 0.95 | 0.25 | 0.71 | Core opportunity |
| I need automatic replay to avoid unsafe side effects | 0.95 | 0.35 | 0.62 | Required for enterprise trust |
| I need reports that are understandable to reviewers | 0.85 | 0.35 | 0.55 | Key adoption driver |
| I need preservation of existing skill instructions | 0.80 | 0.45 | 0.44 | Important for revisions |
| I need replay failures to be diagnosable | 0.75 | 0.40 | 0.45 | Operational maturity |
| I need configurable policy per deployment | 0.70 | 0.30 | 0.49 | Later enterprise hardening |
Top opportunities:
1. Evidence that a draft improves or preserves accepted task behavior.
2. Safe replay with explicit executed/surrogate/blocked coverage.
3. Reviewer-facing explanation that turns raw traces into decisions.
## Product Expansion Ideas
Generated from PM, Designer, and Engineer perspectives.
### Product Manager Ideas
1. Replay Readiness Score: show whether a draft has enough historical evidence before eval starts.
2. Skill Release Gate Levels: allow advisory, strict, and regulated gates per workspace.
3. Regression Triage Queue: collect failed cases and route them to skill authors.
4. Customer-facing Audit Export: export replay report as PDF/Markdown for security review.
5. Skill Quality Trend: show whether a skill improves or degrades across versions.
### Product Designer Ideas
1. Reviewer Decision View: summarize "approve / revise / reject" with reasons before raw JSON.
2. Coverage Timeline: visualize executed, surrogate, and blocked calls per case.
3. Preservation Diff: show dropped or changed sections in a readable comparison.
4. Replay Case Drilldown: task text, expected behavior, baseline output, candidate output, and validator notes.
5. Confidence Language: translate low/medium/high confidence into concrete reviewer actions.
### Engineer Ideas
1. Pluggable Tool Policy Registry: classify tools by toolset, transport, metadata, and deployment risk.
2. Deterministic Replay Fixtures: save replay inputs and traces for reproducible debugging.
3. Sandbox User File Namespace: isolate user-file writes per replay arm.
4. LLM Surrogate Provider: replace deterministic heuristics with structured model judgment when available.
5. Replay Telemetry: metrics for replay latency, failure mode, blocked coverage, and provider availability.
Top 5 selected ideas:
| Rank | Idea | Why Selected | Assumptions To Validate |
| ---: | --- | --- | --- |
| 1 | Reviewer Decision View | Converts technical eval into action | Reviewers trust summarized recommendations |
| 2 | Sandbox User File Namespace | Directly addresses production side-effect risk | Existing file tooling can be redirected cleanly |
| 3 | LLM Surrogate Provider | Improves unsafe tool judgment quality | LLM judgment is consistent enough for review support |
| 4 | Replay Readiness Score | Prevents weak reports from appearing authoritative | Enough metadata exists to estimate readiness |
| 5 | Preservation Diff | Makes revision risk visible and actionable | Section and body-level diffs catch meaningful drops |
## Key Assumptions
| Assumption | Category | Impact | Uncertainty |
| --- | --- | ---: | ---: |
| Accepted historical runs are representative enough to evaluate future skill behavior | Value | High | High |
| Reviewers will use replay reports to make better publish decisions | Value | High | Medium |
| Safe tools can execute in isolation without leaking state or causing production side effects | Feasibility | High | High |
| Surrogate evaluation can judge unsafe tool calls well enough to support review | Feasibility | High | High |
| Coverage and confidence are understandable to non-engineer reviewers | Usability | Medium | High |
| Publish gates will reduce risky releases without blocking too many useful skills | Viability | High | Medium |
| Skill preservation can be detected with lightweight section checks in v1 | Feasibility | Medium | Medium |
| Replay latency will be acceptable for review workflows | Usability | Medium | Medium |
| Customers will value replay eval enough to differentiate Beaver from generic Agent tools | Business Viability | High | Medium |
| The team can maintain tool policy as tools/connectors grow | Team Capability | High | Medium |
## Prioritized Assumptions
Priority = Impact x Uncertainty.
### P0 Validate Immediately
| Assumption | Why It Matters | What Could Go Wrong | Suggested Validation |
| --- | --- | --- | --- |
| Safe replay isolation is real, not only conceptual | One accidental external write can break trust | Replay calls production filesystem, connector, or credential paths | Technical isolation test with destructive and external-write tools |
| Replay reports help reviewers make better decisions | Product value depends on review decisions changing | Reports are too raw, ignored, or misunderstood | Reviewer usability test with 5 draft decisions |
| Surrogate evaluation is good enough for unsafe tools | Many enterprise tools cannot execute in replay | It rubber-stamps bad calls or flags good calls | Golden set of unsafe tool scenarios scored by humans vs surrogate |
| Historical accepted cases are adequate for eval | Weak cases create false confidence | Too few accepted runs or repetitive cases | Analyze real run store coverage across skills |
### P1 Important
| Assumption | Why It Matters | Validation |
| --- | --- | --- |
| Publish gate thresholds are calibrated | Prevents both overblocking and underblocking | Run shadow mode for 2 weeks and compare human decisions |
| Preservation checker catches meaningful draft regressions | Revision safety depends on it | Compare section checker with manual diff review |
| Replay latency fits review workflow | Slow eval hurts adoption | Measure p50/p95 per case and per draft |
| Customers understand confidence and coverage language | Trust depends on clear communication | Customer-facing report comprehension test |
### P2 Later
| Assumption | Why It Matters | Validation |
| --- | --- | --- |
| Per-tool policy UI is needed | May not be needed in v1 | Observe support/admin requests |
| Audit export becomes a buying requirement | Useful for enterprise sales | Ask pilot buyers during procurement review |
| Skill quality trend is a major retention driver | Useful after multiple versions exist | Measure repeat reviewer usage after v1 |
## Opportunity Solution Tree
Desired outcome:
> Increase trusted skill publication: at least 80% of approved skill drafts have replay or explicit skipped-provider evidence, zero known production side effects from replay, and reviewer decision time under 10 minutes for common drafts.
```text
Outcome: Trusted skill publication
Opportunity 1: I need proof that a skill draft improves real task behavior.
Solution 1.1: Baseline vs candidate replay on accepted historical tasks.
Experiment: Run replay on 10 recent skill drafts and compare with manual reviewer judgment.
Solution 1.2: Replay readiness score before evaluation starts.
Experiment: Score existing candidates and check whether low-readiness reports are less useful.
Solution 1.3: Regression triage queue.
Experiment: Manually label failed cases for two weeks and measure fix rate.
Opportunity 2: I need replay to avoid unsafe side effects.
Solution 2.1: Tool mode classification: executed, surrogate, blocked.
Experiment: Golden tool policy test set covering filesystem, MCP, connectors, delete, send, publish.
Solution 2.2: Isolated workspace and user-file namespace per arm.
Experiment: Replay write task and verify no production paths change.
Solution 2.3: Side-effect journal in each case report.
Experiment: Security reviewer reads 5 reports and identifies all intended side effects.
Opportunity 3: I need reports I can act on.
Solution 3.1: Reviewer decision summary with approve/revise/reject guidance.
Experiment: First-click and decision-time test with reviewers.
Solution 3.2: Coverage and confidence explanation.
Experiment: Ask reviewers to explain report meaning after reading it.
Solution 3.3: Preservation diff for revisions.
Experiment: Seed dropped-instruction drafts and measure detection rate.
```
## Validation Experiments
| P0 Assumption | Hypothesis | Experiment | Cost | Duration | Success Criteria | Failure Criteria |
| --- | --- | --- | --- | --- | --- | --- |
| Safe replay isolation | Replay can execute safe tools without touching production state | Build a replay fixture that writes, reads, sends, deletes, and publishes through classified tools | Medium | 2-4 days | 100% production paths untouched; destructive calls blocked; external writes surrogate | Any real external write or production path mutation |
| Reviewer decision value | Replay reports improve approval accuracy and speed | Give 5 reviewers 8 historical drafts with and without replay report | Low | 2 days | Decision accuracy +25%; median decision time under 10 minutes | No improvement or reports misunderstood |
| Surrogate quality | Surrogate scoring agrees with human reviewer on unsafe tool calls | Create 30 unsafe-tool scenarios and compare human labels vs surrogate output | Medium | 3-5 days | >=80% agreement on pass/fail; all high-risk bad calls flagged | High-risk false pass |
| Historical case adequacy | Accepted runs provide enough useful replay cases | Audit run store across top 10 skills/candidates | Low | 1 day | >=70% candidates have >=3 meaningful accepted cases | Most candidates have no usable cases |
## Feature Prioritization
### Must Have
| Feature | Impact | Effort | Risk | Strategic Alignment |
| --- | --- | --- | --- | --- |
| Eval report compatibility fields | High | Low | Low | Required foundation |
| Historical accepted case selection | High | Medium | Medium | Required for behavior evidence |
| Baseline vs candidate replay arms | High | High | High | Core value |
| Tool mode classification | High | Medium | High | Core trust boundary |
| Replay coverage and confidence report | High | Medium | Medium | Reviewer decision support |
| Publish gates for failed/low-confidence replay | High | Low | Medium | Governance promise |
| Preservation check for revise/merge drafts | Medium | Medium | Medium | Prevents silent instruction loss |
| Skills UI report summary | High | Medium | Medium | Adoption requirement |
### Should Have
| Feature | Impact | Effort | Risk | Strategic Alignment |
| --- | --- | --- | --- | --- |
| Reviewer decision summary | High | Medium | Medium | Converts evidence to action |
| Preservation diff view | Medium | Medium | Low | Improves reviewer comprehension |
| Replay readiness score | Medium | Medium | Medium | Prevents false confidence |
| Operational metrics dashboard | Medium | Medium | Low | Needed for maintenance |
| Golden tool policy test suite | High | Medium | Medium | Needed for safety assurance |
### Could Have
| Feature | Impact | Effort | Risk | Strategic Alignment |
| --- | --- | --- | --- | --- |
| Audit export | Medium | Medium | Low | Enterprise sales support |
| Skill quality trend | Medium | Medium | Medium | Useful after version history grows |
| Per-tool admin policy UI | Medium | High | Medium | Enterprise customization |
| Replay fixtures download | Low | Medium | Low | Debugging convenience |
### Not Yet
| Feature | Reason |
| --- | --- |
| Full Docker orchestration per replay case | Too heavy for first release; design explicitly scopes it out |
| Production third-party write replay | Violates trust boundary |
| Removing human review | Replay evidence should support review, not replace it |
| Fully customizable policy UI | Add after policy needs are observed |
Features to cut from v1:
- Per-tool policy UI.
- Audit export.
- Skill quality trend.
- Full Docker-per-case orchestration.
Features likely over-engineered for v1:
- Customer-configurable replay policies before default policy is proven.
- Complex statistical scoring before case quality and surrogate accuracy are validated.
- Automatic publish for high-scoring drafts.
## Metrics Dashboard
### North Star Metric
Trusted Skill Publish Rate:
> Approved skill drafts with usable eval evidence and no post-publish regression reports / total approved skill drafts, measured weekly.
Target for v1 pilot: >=80%.
### Input Metrics
| Metric | Definition | Data Source | Visualization | Target | Alert Threshold |
| --- | --- | --- | --- | --- | --- |
| Replay Evidence Coverage | Draft eval reports with mode `replay` or explicit skipped-provider status / all eval reports | Skill eval store | Weekly line | >=80% | <60% for 2 weeks |
| Executed Tool Coverage | Executed tool calls / all replay tool calls | Case reports | Stacked bar | >=50% for safe-tool skills | <25% for safe-tool skills |
| Surrogate Coverage | Surrogate tool calls / all replay tool calls | Case reports | Stacked bar | Transparent, not necessarily low | Sudden +30% week over week |
| Blocked Coverage | Blocked tool calls / all replay tool calls | Case reports | Stacked bar | <10% | >=25% or any blocked_coverage=1.0 |
| Reviewer Decision Time | Time from eval report created to approve/reject/revise | Review events | Median and p95 | Median <10 min | p95 >30 min |
| Replay Regression Rate | Reports with regression_count > 0 / replay reports | Eval store | Weekly line | Investigate, not zero-forced | >20% |
### Leading Indicators
- Number of accepted runs eligible for replay per skill.
- Percentage of candidates with at least 3 replay cases.
- Provider unavailable skip rate.
- Replay error or partial status rate.
- Preservation failures per revised skill draft.
### Guardrail Metrics
| Guardrail | Definition | Alert |
| --- | --- | --- |
| Production Side Effect Incidents | Any replay-caused write to production workspace, user files, credentials, or external systems | Immediate P0 |
| False Pass Incidents | Published draft later confirmed to regress an accepted workflow despite passing replay | Weekly review; P1 if repeated |
| False Block Incidents | Useful draft blocked due to bad policy or low-confidence bug | Weekly review |
| Replay Latency | p95 replay completion time per draft | Alert if p95 >15 minutes in pilot |
| Report Comprehension | Reviewers correctly explain coverage/confidence in usability tests | Rework UI copy if <80% |
### Review Cadence
- Daily during pilot: replay errors, side-effect alerts, provider skips.
- Weekly: publish outcomes, regression rate, reviewer decision time, blocked/surrogate coverage.
- Monthly: threshold calibration and customer feedback.
- Quarterly: policy model, scoring model, and roadmap review.
## Customer Research Plan
No customer interviews or support tickets were provided. Run research before treating demand and usability assumptions as validated.
### Research Participants
- 3-5 internal skill reviewers or admins.
- 3 workflow owners who want accepted tasks converted into reusable skills.
- 2 enterprise/security stakeholders who review AI governance.
- 2 engineers/operators responsible for deployment and incident response.
### Research Questions
- What evidence do reviewers need before approving a reusable skill?
- Which replay report fields are meaningful, and which are noise?
- Do users understand executed vs surrogate vs blocked coverage?
- What level of uncertainty is acceptable for publishing?
- What customer-facing proof is needed for enterprise pilots?
- Which tool categories must never execute during replay?
### Recommended Actions
- Run a moderated reviewer test with current Skills page report.
- Create 5 seeded draft cases: clear improvement, clear regression, unsafe external write, preservation drop, provider unavailable.
- Ask participants to approve/revise/reject each case and explain why.
- Compare their decisions with current publish gate behavior.
## Interview Guide
### Objectives
- Validate whether replay evidence changes approval behavior.
- Identify confusing report language.
- Understand risk tolerance for surrogate and blocked calls.
- Learn what artifacts enterprise buyers need for adoption.
### Warm-Up
- Tell me about the last time you reviewed or approved reusable AI guidance, prompts, tools, or workflows.
- What made the approval easy or hard?
- What happened after it was approved?
### JTBD Questions
- Walk me through the last time an AI workflow worked well enough that you wanted to reuse it.
- What evidence did you have that it would work again?
- What would make you hesitate to publish it for others?
- What does "safe to publish" mean in your environment?
### Behavioral Questions
- Show me how you would decide whether this draft should be approved.
- Which part of this report would you read first?
- What would you ignore?
- What would you ask an engineer to explain?
### Risk Validation Questions
- If a replay report says 70% executed and 30% surrogate, what decision would you make?
- If all important external writes were surrogate-evaluated, is that enough for review?
- Which tools should always be blocked in your environment?
- What kind of failure would make you disable replay eval?
### Note Template
```text
Participant:
Role:
Date:
Last relevant review:
Decision evidence needed:
Confusing report fields:
Risk tolerance:
Must-block tool categories:
Minimum publish evidence:
Unexpected insight:
Follow-up:
```
## Recommended Next 30 Days
1. Validate replay isolation with a golden tool policy suite.
2. Run current backend unit tests around skill learning replay and publish gates.
3. Add a small reviewer decision summary above raw replay details.
4. Run 5-8 reviewer usability sessions using seeded draft cases.
5. Audit accepted run coverage for top skills and identify gaps.
6. Decide v1 gate thresholds for blocked coverage, confidence, and preservation failure.
7. Add operational logging and metrics for replay status, latency, and provider skips.
## Recommended Next 90 Days
1. Replace or augment deterministic surrogate scoring with structured LLM judgment and human-labeled calibration cases.
2. Add replay readiness scoring before eval starts.
3. Improve preservation from section presence to diff-based critical instruction detection.
4. Add customer/exportable audit summary for enterprise pilot conversations.
5. Build a replay operations dashboard.
6. Introduce deployment-level policy profiles only after default policies produce stable data.
7. Track skill quality across versions and post-publish regression reports.
## Biggest Risks
| Risk | Severity | Mitigation |
| --- | --- | --- |
| Replay accidentally mutates production state | Critical | Golden policy tests, isolated namespaces, external writes surrogate by default, P0 alert |
| Surrogate scoring gives false confidence | High | Human-labeled calibration set, show low confidence clearly, no automatic publish |
| Reviewers ignore report complexity | High | Decision summary, comprehension testing, action-oriented UI copy |
| Accepted run data is too sparse | High | Readiness score, fallback to explicit skipped/low-evidence state, collect more accepted cases |
| Publish gates block too many useful skills | Medium | Shadow mode calibration and override with explicit review rationale |
| Evaluation costs or latency grow quickly | Medium | Cap cases, cache web/search, track p95 latency, async background eval |
## Recommended Immediate Actions
1. Treat Skill Replay Eval as a v1 trust gate, not a complete benchmark.
2. Keep human review mandatory for publish.
3. Do not execute production third-party writes during automatic replay.
4. Add reviewer-facing explanations before adding more raw report data.
5. Validate isolation and surrogate quality before broad rollout.
6. Use the first pilot to learn threshold calibration, not to claim perfect quality measurement.