Files

steven_li fc9fd93c36 feat: 支持多语言提示词本地化和界面优化

- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化
- 移除内置 agents 配置以简化系统架构
- 更新 ContextBuilder 使用动态提示词模板而非硬编码内容
- 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数
- 添加输出语言指令确保用户界面内容按指定语言生成
- 扩展前端 LanguageSwitcher 组件支持三种语言选项
- 优化 Header 和侧边栏组件的响应式布局和文本截断处理
- 更新测试用例验证不同语言环境下的提示词正确性

2026-06-10 16:11:05 +08:00

12 KiB

Raw Blame History

Skill Replay Eval Launch And Maintenance Runbook

Date: 2026-06-09

Purpose: define how to validate, launch, operate, and maintain Skill Replay Eval safely.

1. Launch Principle

Ship Skill Replay Eval as a guarded trust feature.

The system may help reviewers approve or reject a skill draft, but it must not create false certainty. When evidence is weak, the product should say so clearly. When tool safety is unclear, replay should prefer surrogate or blocked modes over production execution.

2. Ownership

Area	Owner	Responsibility
Product quality	Product owner	Metrics, pilot feedback, publish threshold decisions
Replay pipeline	Backend engineer	Case selection, replay runner, scoring, report persistence
Tool safety policy	Backend + security reviewer	Tool classification, blocked/surrogate rules, side-effect tests
Skills UI	Frontend/design owner	Report summary, reviewer decision flow, report readability
Operations	Deployment owner	Logs, alerts, provider availability, incident response
Customer pilot	Pilot lead	Participant selection, feedback, rollout communication

3. Pre-Launch Readiness

Required Code Checks

Run backend tests from app-instance/backend:

pytest tests/unit/test_skill_learning_eval_report_model.py -v
pytest tests/unit/test_skill_learning_case_selection.py -v
pytest tests/unit/test_skill_learning_preservation.py -v
pytest tests/unit/test_skill_learning_replay.py -v
pytest tests/unit/test_skill_learning_replay_runner.py -v
pytest tests/unit/test_agent_loop_replay_executor.py -v
pytest tests/unit/test_skill_learning_surrogate.py -v
pytest tests/unit/test_skill_learning_eval.py -v
pytest tests/unit/test_skill_learning_pipeline.py -v

Run frontend verification from app-instance/frontend:

npm run lint
npm run test -- --runInBand

If frontend tests are not configured, perform manual Skills page verification with seeded report payloads.

Golden Safety Cases

Before pilot launch, create or manually verify a golden set with these cases:

Case	Expected Result
Safe filesystem read	`executed`
Safe filesystem write to replay workspace	`executed`, no production write
User-file write in replay namespace	`executed` only if isolated, otherwise `surrogate`
Web/search read	`executed` or cached read
Email send	`surrogate`
Calendar invite	`surrogate`
Connector publish/post/reply	`surrogate`
Delete/remove/destroy	`blocked`
Permission/credential/payment action	`blocked`

Launch blocker:

Any replay case mutates production workspace, user files, credentials, external accounts, permissions, or payment state.

Report Readiness Checks

Each replay report must show:

Eval status.
Baseline average.
Candidate average.
Score delta.
Improved/regressed/unchanged counts.
Execution coverage.
Surrogate coverage.
Blocked coverage.
Confidence.
Replay cases.
Case reports.
Preservation report when applicable.
Raw report for debugging.

Publish Gate Checks

Publish must fail when:

No approved review exists.
Safety report is missing or failed.
Eval report failed, except explicit skipped-provider status.
Replay confidence is low.
Replay blocked coverage is 1.0.
Preservation report failed.

Publish may proceed with explicit human review when:

Provider is unavailable and eval status is skipped_provider_unavailable.
Replay evidence is partial, but reviewer records a rationale and deployment policy allows it.

4. Rollout Plan

Phase 0: Shadow Mode

Audience: internal team only.

Duration: 1 week or 10 draft evaluations, whichever comes first.

Behavior:

Generate replay reports.
Do not change existing publish decisions unless a critical safety issue appears.
Compare replay recommendation with human reviewer decision.

Exit criteria:

No production side effects.
No unexplained replay crashes on common drafts.
Reviewers can explain report meaning.
Product owner reviews gate threshold data.

Phase 1: Strict Internal Gate

Audience: internal maintainers and trusted reviewers.

Behavior:

Enforce low-confidence, blocked coverage, failed preservation, failed eval, and failed safety gates.
Require manual rationale for skipped-provider publish.

Exit criteria:

0 P0 incidents.
Publish blockers are actionable and not noisy.
Reviewer median decision time under 10 minutes for common drafts.

Phase 2: Pilot Customer Gate

Audience: selected pilot customer or internal department.

Behavior:

Keep human review mandatory.
Provide customer-facing explanation of replay evidence.
Track skipped-provider and low-confidence cases closely.

Exit criteria:

Pilot admin accepts report as useful governance evidence.
No side-effect incidents.
Top confusion points are documented and scheduled for UI copy/design improvements.

Phase 3: General Availability Candidate

Audience: all enabled deployments.

Behavior:

Replay Eval enabled by default where provider and case data are available.
Skipped-provider state remains explicit.
Tool policy remains conservative.

Exit criteria:

Operational dashboard exists.
Incident response is rehearsed.
Policy change process is documented.

5. Monitoring

Product Metrics

Metric	Owner	Cadence	Alert
Trusted Skill Publish Rate	Product	Weekly	<60% for 2 weeks
Reviewer Decision Time	Product/design	Weekly	p95 >30 minutes
Replay Regression Rate	Product/engineering	Weekly	>20% of replay reports
Report Comprehension	Product/design	Per research round	<80% explain coverage/confidence correctly

Operational Metrics

Metric	Owner	Cadence	Alert
Replay status counts	Engineering	Daily during pilot	Any spike in `replay_error` or `partial`
Provider unavailable skip rate	Operations	Daily	>25% of evals in pilot
Replay latency p50/p95	Engineering	Daily	p95 >15 minutes
Blocked coverage	Security/engineering	Weekly	Any report with blocked_coverage=1.0
Production side-effect incidents	Security/operations	Immediate	Any nonzero event
Failed preservation reports	Product/engineering	Weekly	Spike after synthesizer change

Logs To Inspect

Skill learning candidate events.
Draft creation and safety report events.
Eval report generation events.
Replay arm run ids and source skill_replay_eval.
Tool traces and classification reasons.
Publish gate errors.
Provider unavailable errors.

6. Incident Response

P0: Production Side Effect During Replay

Examples:

Email sent.
Calendar invite created.
External connector publish/post/reply happened.
Production file or credential changed.
Permission/payment action executed.

Immediate actions:

Disable replay eval generation.
Disable skill publish if policy risk is unclear.
Preserve logs, replay traces, eval reports, and affected tool metadata.
Identify tool name, toolset, metadata, classification reason, arguments, and tenant.
Patch policy to block or surrogate affected class.
Add a regression test to golden safety cases.
Notify pilot/customer owner if customer data or systems were affected.

Restart criteria:

Root cause documented.
Regression test passes.
Security owner approves restart.

P1: False Pass

Definition: draft passed replay and was published, then confirmed to regress a real accepted workflow.

Actions:

Unpublish or revert skill version if impact is active.
Add the failed task as a replay case.
Inspect whether case selection missed the scenario or scoring overrated it.
Adjust gate threshold, surrogate scoring, or preservation check.
Record postmortem in skill quality log.

P1: False Block

Definition: useful draft blocked due to bad replay policy, low-confidence bug, or report construction issue.

Actions:

Do not bypass silently; record reviewer rationale.
Identify blocking rule and trace.
Add regression test if policy bug.
Decide whether threshold should change or case should remain blocked.

P2: Provider Unavailable Spike

Actions:

Check provider configuration and model availability.
Confirm fallback status is explicit.
Track how many publish decisions rely on skipped-provider.
Pause broad rollout if skipped-provider exceeds pilot threshold.

7. Maintenance Cadence

Daily During Pilot

Check replay errors and provider skips.
Check blocked_coverage=1.0 reports.
Confirm no side-effect incidents.
Review new publish gate failures.

Weekly

Review metrics dashboard.
Calibrate publish gate thresholds.
Review 3-5 replay reports for readability.
Inspect false pass/false block candidates.
Update tool policy based on new tools or connectors.

Monthly

Review customer/pilot feedback.
Refresh golden safety cases.
Sample preservation reports for missed instruction drops.
Review storage growth from replay case reports and traces.
Decide whether to promote features from Should Have to Must Have.

Quarterly

Revisit risk model and tool policy profiles.
Review whether LLM surrogate calibration meets quality target.
Decide whether to add audit export or per-deployment policy UI.
Retire stale replay cases or update case selection logic.

8. Data Retention And Privacy

Replay reports may contain task text, tool arguments, schemas, final answers, and side-effect descriptions. Treat them as sensitive operational data.

Recommended policy:

Store summarized report for normal review.
Limit raw case report retention or restrict access to admins.
Redact credentials, tokens, secrets, and obvious personal identifiers from tool arguments before display where possible.
Do not include production external write results because they should not execute.
Define tenant-specific retention before enterprise rollout.

9. Release Communication

Internal Message

Skill Replay Eval adds evidence to skill publishing. Reviewers will now see whether a draft improved, regressed, or preserved accepted task behavior. Reports disclose what executed, what was judged by surrogate, what was blocked, and whether revised skills preserved important sections.

Customer / Pilot Message

Beaver can now evaluate reusable skill drafts against prior accepted work before publication. The report shows both confidence and uncertainty. Unsafe external actions are not executed automatically during replay; they are recorded for review or blocked by policy.

Known Limitations To Disclose

Replay quality depends on available accepted historical runs.
Surrogate evaluation is not the same as real execution.
Low-confidence reports require more human review.
Human approval is still required.
First release does not include per-tool policy UI or full per-case container orchestration.

10. Rollback Plan

Rollback options:

Disable replay runner injection and fall back to heuristic eval.
Keep report fields but set mode to heuristic.
Keep publish gate requiring safety and human review.
Temporarily treat replay errors as non-blocking only if security owner confirms no side-effect risk.
Preserve failed replay reports for debugging.

Rollback triggers:

Any P0 side-effect incident.
Repeated replay errors that block normal skill review.
Provider unavailable spike that makes most reports skipped.
Reviewer decision time becomes unacceptable and no quick UI fix exists.

11. Launch Checklist

Backend replay tests pass.
Frontend report rendering verified.
Golden tool safety cases pass.
No production side-effect path found.
Publish gates tested manually.
Skipped-provider copy is clear.
Reviewer decision summary exists or is tracked as a launch follow-up.
Pilot participants selected.
Metrics dashboard owner assigned.
Incident owner and escalation path assigned.
Rollback path verified.

12 KiB Raw Blame History

Skill Replay Eval Launch And Maintenance Runbook

1. Launch Principle

2. Ownership

3. Pre-Launch Readiness

Required Code Checks

Golden Safety Cases

Report Readiness Checks

Publish Gate Checks

4. Rollout Plan

Phase 0: Shadow Mode

Phase 1: Strict Internal Gate

Phase 2: Pilot Customer Gate

Phase 3: General Availability Candidate

5. Monitoring

Product Metrics

Operational Metrics

Logs To Inspect

6. Incident Response

P0: Production Side Effect During Replay

P1: False Pass

P1: False Block

P2: Provider Unavailable Spike

7. Maintenance Cadence

Daily During Pilot

Weekly

Monthly

Quarterly

8. Data Retention And Privacy

9. Release Communication

Internal Message

Customer / Pilot Message

Known Limitations To Disclose

10. Rollback Plan

11. Launch Checklist

12 KiB

Raw Blame History