Files
beaver_project/docs/product-discovery/skill-replay-eval/launch-maintenance-runbook.md
steven_li fc9fd93c36 feat: 支持多语言提示词本地化和界面优化
- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化
- 移除内置 agents 配置以简化系统架构
- 更新 ContextBuilder 使用动态提示词模板而非硬编码内容
- 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数
- 添加输出语言指令确保用户界面内容按指定语言生成
- 扩展前端 LanguageSwitcher 组件支持三种语言选项
- 优化 Header 和侧边栏组件的响应式布局和文本截断处理
- 更新测试用例验证不同语言环境下的提示词正确性
2026-06-10 16:11:05 +08:00

12 KiB

Skill Replay Eval Launch And Maintenance Runbook

Date: 2026-06-09

Purpose: define how to validate, launch, operate, and maintain Skill Replay Eval safely.

1. Launch Principle

Ship Skill Replay Eval as a guarded trust feature.

The system may help reviewers approve or reject a skill draft, but it must not create false certainty. When evidence is weak, the product should say so clearly. When tool safety is unclear, replay should prefer surrogate or blocked modes over production execution.

2. Ownership

Area Owner Responsibility
Product quality Product owner Metrics, pilot feedback, publish threshold decisions
Replay pipeline Backend engineer Case selection, replay runner, scoring, report persistence
Tool safety policy Backend + security reviewer Tool classification, blocked/surrogate rules, side-effect tests
Skills UI Frontend/design owner Report summary, reviewer decision flow, report readability
Operations Deployment owner Logs, alerts, provider availability, incident response
Customer pilot Pilot lead Participant selection, feedback, rollout communication

3. Pre-Launch Readiness

Required Code Checks

Run backend tests from app-instance/backend:

pytest tests/unit/test_skill_learning_eval_report_model.py -v
pytest tests/unit/test_skill_learning_case_selection.py -v
pytest tests/unit/test_skill_learning_preservation.py -v
pytest tests/unit/test_skill_learning_replay.py -v
pytest tests/unit/test_skill_learning_replay_runner.py -v
pytest tests/unit/test_agent_loop_replay_executor.py -v
pytest tests/unit/test_skill_learning_surrogate.py -v
pytest tests/unit/test_skill_learning_eval.py -v
pytest tests/unit/test_skill_learning_pipeline.py -v

Run frontend verification from app-instance/frontend:

npm run lint
npm run test -- --runInBand

If frontend tests are not configured, perform manual Skills page verification with seeded report payloads.

Golden Safety Cases

Before pilot launch, create or manually verify a golden set with these cases:

Case Expected Result
Safe filesystem read executed
Safe filesystem write to replay workspace executed, no production write
User-file write in replay namespace executed only if isolated, otherwise surrogate
Web/search read executed or cached read
Email send surrogate
Calendar invite surrogate
Connector publish/post/reply surrogate
Delete/remove/destroy blocked
Permission/credential/payment action blocked

Launch blocker:

  • Any replay case mutates production workspace, user files, credentials, external accounts, permissions, or payment state.

Report Readiness Checks

Each replay report must show:

  • Eval status.
  • Baseline average.
  • Candidate average.
  • Score delta.
  • Improved/regressed/unchanged counts.
  • Execution coverage.
  • Surrogate coverage.
  • Blocked coverage.
  • Confidence.
  • Replay cases.
  • Case reports.
  • Preservation report when applicable.
  • Raw report for debugging.

Publish Gate Checks

Publish must fail when:

  • No approved review exists.
  • Safety report is missing or failed.
  • Eval report failed, except explicit skipped-provider status.
  • Replay confidence is low.
  • Replay blocked coverage is 1.0.
  • Preservation report failed.

Publish may proceed with explicit human review when:

  • Provider is unavailable and eval status is skipped_provider_unavailable.
  • Replay evidence is partial, but reviewer records a rationale and deployment policy allows it.

4. Rollout Plan

Phase 0: Shadow Mode

Audience: internal team only.

Duration: 1 week or 10 draft evaluations, whichever comes first.

Behavior:

  • Generate replay reports.
  • Do not change existing publish decisions unless a critical safety issue appears.
  • Compare replay recommendation with human reviewer decision.

Exit criteria:

  • No production side effects.
  • No unexplained replay crashes on common drafts.
  • Reviewers can explain report meaning.
  • Product owner reviews gate threshold data.

Phase 1: Strict Internal Gate

Audience: internal maintainers and trusted reviewers.

Behavior:

  • Enforce low-confidence, blocked coverage, failed preservation, failed eval, and failed safety gates.
  • Require manual rationale for skipped-provider publish.

Exit criteria:

  • 0 P0 incidents.
  • Publish blockers are actionable and not noisy.
  • Reviewer median decision time under 10 minutes for common drafts.

Phase 2: Pilot Customer Gate

Audience: selected pilot customer or internal department.

Behavior:

  • Keep human review mandatory.
  • Provide customer-facing explanation of replay evidence.
  • Track skipped-provider and low-confidence cases closely.

Exit criteria:

  • Pilot admin accepts report as useful governance evidence.
  • No side-effect incidents.
  • Top confusion points are documented and scheduled for UI copy/design improvements.

Phase 3: General Availability Candidate

Audience: all enabled deployments.

Behavior:

  • Replay Eval enabled by default where provider and case data are available.
  • Skipped-provider state remains explicit.
  • Tool policy remains conservative.

Exit criteria:

  • Operational dashboard exists.
  • Incident response is rehearsed.
  • Policy change process is documented.

5. Monitoring

Product Metrics

Metric Owner Cadence Alert
Trusted Skill Publish Rate Product Weekly <60% for 2 weeks
Reviewer Decision Time Product/design Weekly p95 >30 minutes
Replay Regression Rate Product/engineering Weekly >20% of replay reports
Report Comprehension Product/design Per research round <80% explain coverage/confidence correctly

Operational Metrics

Metric Owner Cadence Alert
Replay status counts Engineering Daily during pilot Any spike in replay_error or partial
Provider unavailable skip rate Operations Daily >25% of evals in pilot
Replay latency p50/p95 Engineering Daily p95 >15 minutes
Blocked coverage Security/engineering Weekly Any report with blocked_coverage=1.0
Production side-effect incidents Security/operations Immediate Any nonzero event
Failed preservation reports Product/engineering Weekly Spike after synthesizer change

Logs To Inspect

  • Skill learning candidate events.
  • Draft creation and safety report events.
  • Eval report generation events.
  • Replay arm run ids and source skill_replay_eval.
  • Tool traces and classification reasons.
  • Publish gate errors.
  • Provider unavailable errors.

6. Incident Response

P0: Production Side Effect During Replay

Examples:

  • Email sent.
  • Calendar invite created.
  • External connector publish/post/reply happened.
  • Production file or credential changed.
  • Permission/payment action executed.

Immediate actions:

  1. Disable replay eval generation.
  2. Disable skill publish if policy risk is unclear.
  3. Preserve logs, replay traces, eval reports, and affected tool metadata.
  4. Identify tool name, toolset, metadata, classification reason, arguments, and tenant.
  5. Patch policy to block or surrogate affected class.
  6. Add a regression test to golden safety cases.
  7. Notify pilot/customer owner if customer data or systems were affected.

Restart criteria:

  • Root cause documented.
  • Regression test passes.
  • Security owner approves restart.

P1: False Pass

Definition: draft passed replay and was published, then confirmed to regress a real accepted workflow.

Actions:

  1. Unpublish or revert skill version if impact is active.
  2. Add the failed task as a replay case.
  3. Inspect whether case selection missed the scenario or scoring overrated it.
  4. Adjust gate threshold, surrogate scoring, or preservation check.
  5. Record postmortem in skill quality log.

P1: False Block

Definition: useful draft blocked due to bad replay policy, low-confidence bug, or report construction issue.

Actions:

  1. Do not bypass silently; record reviewer rationale.
  2. Identify blocking rule and trace.
  3. Add regression test if policy bug.
  4. Decide whether threshold should change or case should remain blocked.

P2: Provider Unavailable Spike

Actions:

  1. Check provider configuration and model availability.
  2. Confirm fallback status is explicit.
  3. Track how many publish decisions rely on skipped-provider.
  4. Pause broad rollout if skipped-provider exceeds pilot threshold.

7. Maintenance Cadence

Daily During Pilot

  • Check replay errors and provider skips.
  • Check blocked_coverage=1.0 reports.
  • Confirm no side-effect incidents.
  • Review new publish gate failures.

Weekly

  • Review metrics dashboard.
  • Calibrate publish gate thresholds.
  • Review 3-5 replay reports for readability.
  • Inspect false pass/false block candidates.
  • Update tool policy based on new tools or connectors.

Monthly

  • Review customer/pilot feedback.
  • Refresh golden safety cases.
  • Sample preservation reports for missed instruction drops.
  • Review storage growth from replay case reports and traces.
  • Decide whether to promote features from Should Have to Must Have.

Quarterly

  • Revisit risk model and tool policy profiles.
  • Review whether LLM surrogate calibration meets quality target.
  • Decide whether to add audit export or per-deployment policy UI.
  • Retire stale replay cases or update case selection logic.

8. Data Retention And Privacy

Replay reports may contain task text, tool arguments, schemas, final answers, and side-effect descriptions. Treat them as sensitive operational data.

Recommended policy:

  • Store summarized report for normal review.
  • Limit raw case report retention or restrict access to admins.
  • Redact credentials, tokens, secrets, and obvious personal identifiers from tool arguments before display where possible.
  • Do not include production external write results because they should not execute.
  • Define tenant-specific retention before enterprise rollout.

9. Release Communication

Internal Message

Skill Replay Eval adds evidence to skill publishing. Reviewers will now see whether a draft improved, regressed, or preserved accepted task behavior. Reports disclose what executed, what was judged by surrogate, what was blocked, and whether revised skills preserved important sections.

Customer / Pilot Message

Beaver can now evaluate reusable skill drafts against prior accepted work before publication. The report shows both confidence and uncertainty. Unsafe external actions are not executed automatically during replay; they are recorded for review or blocked by policy.

Known Limitations To Disclose

  • Replay quality depends on available accepted historical runs.
  • Surrogate evaluation is not the same as real execution.
  • Low-confidence reports require more human review.
  • Human approval is still required.
  • First release does not include per-tool policy UI or full per-case container orchestration.

10. Rollback Plan

Rollback options:

  1. Disable replay runner injection and fall back to heuristic eval.
  2. Keep report fields but set mode to heuristic.
  3. Keep publish gate requiring safety and human review.
  4. Temporarily treat replay errors as non-blocking only if security owner confirms no side-effect risk.
  5. Preserve failed replay reports for debugging.

Rollback triggers:

  • Any P0 side-effect incident.
  • Repeated replay errors that block normal skill review.
  • Provider unavailable spike that makes most reports skipped.
  • Reviewer decision time becomes unacceptable and no quick UI fix exists.

11. Launch Checklist

  • Backend replay tests pass.
  • Frontend report rendering verified.
  • Golden tool safety cases pass.
  • No production side-effect path found.
  • Publish gates tested manually.
  • Skipped-provider copy is clear.
  • Reviewer decision summary exists or is tracked as a launch follow-up.
  • Pilot participants selected.
  • Metrics dashboard owner assigned.
  • Incident owner and escalation path assigned.
  • Rollback path verified.