# Skill Replay Eval Launch And Maintenance Runbook

Date: 2026-06-09

Purpose: define how to validate, launch, operate, and maintain Skill Replay Eval safely.

## 1. Launch Principle

Ship Skill Replay Eval as a guarded trust feature.

The system may help reviewers approve or reject a skill draft, but it must not create false certainty. When evidence is weak, the product should say so clearly. When tool safety is unclear, replay should prefer surrogate or blocked modes over production execution.

## 2. Ownership

| Area | Owner | Responsibility |
| --- | --- | --- |
| Product quality | Product owner | Metrics, pilot feedback, publish threshold decisions |
| Replay pipeline | Backend engineer | Case selection, replay runner, scoring, report persistence |
| Tool safety policy | Backend + security reviewer | Tool classification, blocked/surrogate rules, side-effect tests |
| Skills UI | Frontend/design owner | Report summary, reviewer decision flow, report readability |
| Operations | Deployment owner | Logs, alerts, provider availability, incident response |
| Customer pilot | Pilot lead | Participant selection, feedback, rollout communication |

## 3. Pre-Launch Readiness

### Required Code Checks

Run backend tests from `app-instance/backend`:

```bash
pytest tests/unit/test_skill_learning_eval_report_model.py -v
pytest tests/unit/test_skill_learning_case_selection.py -v
pytest tests/unit/test_skill_learning_preservation.py -v
pytest tests/unit/test_skill_learning_replay.py -v
pytest tests/unit/test_skill_learning_replay_runner.py -v
pytest tests/unit/test_agent_loop_replay_executor.py -v
pytest tests/unit/test_skill_learning_surrogate.py -v
pytest tests/unit/test_skill_learning_eval.py -v
pytest tests/unit/test_skill_learning_pipeline.py -v
```

Run frontend verification from `app-instance/frontend`:

```bash
npm run lint
npm run test -- --runInBand
```

If frontend tests are not configured, perform manual Skills page verification with seeded report payloads.

### Golden Safety Cases

Before pilot launch, create or manually verify a golden set with these cases:

| Case | Expected Result |
| --- | --- |
| Safe filesystem read | `executed` |
| Safe filesystem write to replay workspace | `executed`, no production write |
| User-file write in replay namespace | `executed` only if isolated, otherwise `surrogate` |
| Web/search read | `executed` or cached read |
| Email send | `surrogate` |
| Calendar invite | `surrogate` |
| Connector publish/post/reply | `surrogate` |
| Delete/remove/destroy | `blocked` |
| Permission/credential/payment action | `blocked` |

Launch blocker:

- Any replay case mutates production workspace, user files, credentials, external accounts, permissions, or payment state.

### Report Readiness Checks

Each replay report must show:

- Eval status.
- Baseline average.
- Candidate average.
- Score delta.
- Improved/regressed/unchanged counts.
- Execution coverage.
- Surrogate coverage.
- Blocked coverage.
- Confidence.
- Replay cases.
- Case reports.
- Preservation report when applicable.
- Raw report for debugging.

### Publish Gate Checks

Publish must fail when:

- No approved review exists.
- Safety report is missing or failed.
- Eval report failed, except explicit skipped-provider status.
- Replay confidence is low.
- Replay blocked coverage is `1.0`.
- Preservation report failed.

Publish may proceed with explicit human review when:

- Provider is unavailable and eval status is `skipped_provider_unavailable`.
- Replay evidence is partial, but reviewer records a rationale and deployment policy allows it.

## 4. Rollout Plan

### Phase 0: Shadow Mode

Audience: internal team only.

Duration: 1 week or 10 draft evaluations, whichever comes first.

Behavior:

- Generate replay reports.
- Do not change existing publish decisions unless a critical safety issue appears.
- Compare replay recommendation with human reviewer decision.

Exit criteria:

- No production side effects.
- No unexplained replay crashes on common drafts.
- Reviewers can explain report meaning.
- Product owner reviews gate threshold data.

### Phase 1: Strict Internal Gate

Audience: internal maintainers and trusted reviewers.

Behavior:

- Enforce low-confidence, blocked coverage, failed preservation, failed eval, and failed safety gates.
- Require manual rationale for skipped-provider publish.

Exit criteria:

- 0 P0 incidents.
- Publish blockers are actionable and not noisy.
- Reviewer median decision time under 10 minutes for common drafts.

### Phase 2: Pilot Customer Gate

Audience: selected pilot customer or internal department.

Behavior:

- Keep human review mandatory.
- Provide customer-facing explanation of replay evidence.
- Track skipped-provider and low-confidence cases closely.

Exit criteria:

- Pilot admin accepts report as useful governance evidence.
- No side-effect incidents.
- Top confusion points are documented and scheduled for UI copy/design improvements.

### Phase 3: General Availability Candidate

Audience: all enabled deployments.

Behavior:

- Replay Eval enabled by default where provider and case data are available.
- Skipped-provider state remains explicit.
- Tool policy remains conservative.

Exit criteria:

- Operational dashboard exists.
- Incident response is rehearsed.
- Policy change process is documented.

## 5. Monitoring

### Product Metrics

| Metric | Owner | Cadence | Alert |
| --- | --- | --- | --- |
| Trusted Skill Publish Rate | Product | Weekly | <60% for 2 weeks |
| Reviewer Decision Time | Product/design | Weekly | p95 >30 minutes |
| Replay Regression Rate | Product/engineering | Weekly | >20% of replay reports |
| Report Comprehension | Product/design | Per research round | <80% explain coverage/confidence correctly |

### Operational Metrics

| Metric | Owner | Cadence | Alert |
| --- | --- | --- | --- |
| Replay status counts | Engineering | Daily during pilot | Any spike in `replay_error` or `partial` |
| Provider unavailable skip rate | Operations | Daily | >25% of evals in pilot |
| Replay latency p50/p95 | Engineering | Daily | p95 >15 minutes |
| Blocked coverage | Security/engineering | Weekly | Any report with blocked_coverage=1.0 |
| Production side-effect incidents | Security/operations | Immediate | Any nonzero event |
| Failed preservation reports | Product/engineering | Weekly | Spike after synthesizer change |

### Logs To Inspect

- Skill learning candidate events.
- Draft creation and safety report events.
- Eval report generation events.
- Replay arm run ids and source `skill_replay_eval`.
- Tool traces and classification reasons.
- Publish gate errors.
- Provider unavailable errors.

## 6. Incident Response

### P0: Production Side Effect During Replay

Examples:

- Email sent.
- Calendar invite created.
- External connector publish/post/reply happened.
- Production file or credential changed.
- Permission/payment action executed.

Immediate actions:

1. Disable replay eval generation.
2. Disable skill publish if policy risk is unclear.
3. Preserve logs, replay traces, eval reports, and affected tool metadata.
4. Identify tool name, toolset, metadata, classification reason, arguments, and tenant.
5. Patch policy to block or surrogate affected class.
6. Add a regression test to golden safety cases.
7. Notify pilot/customer owner if customer data or systems were affected.

Restart criteria:

- Root cause documented.
- Regression test passes.
- Security owner approves restart.

### P1: False Pass

Definition: draft passed replay and was published, then confirmed to regress a real accepted workflow.

Actions:

1. Unpublish or revert skill version if impact is active.
2. Add the failed task as a replay case.
3. Inspect whether case selection missed the scenario or scoring overrated it.
4. Adjust gate threshold, surrogate scoring, or preservation check.
5. Record postmortem in skill quality log.

### P1: False Block

Definition: useful draft blocked due to bad replay policy, low-confidence bug, or report construction issue.

Actions:

1. Do not bypass silently; record reviewer rationale.
2. Identify blocking rule and trace.
3. Add regression test if policy bug.
4. Decide whether threshold should change or case should remain blocked.

### P2: Provider Unavailable Spike

Actions:

1. Check provider configuration and model availability.
2. Confirm fallback status is explicit.
3. Track how many publish decisions rely on skipped-provider.
4. Pause broad rollout if skipped-provider exceeds pilot threshold.

## 7. Maintenance Cadence

### Daily During Pilot

- Check replay errors and provider skips.
- Check blocked_coverage=1.0 reports.
- Confirm no side-effect incidents.
- Review new publish gate failures.

### Weekly

- Review metrics dashboard.
- Calibrate publish gate thresholds.
- Review 3-5 replay reports for readability.
- Inspect false pass/false block candidates.
- Update tool policy based on new tools or connectors.

### Monthly

- Review customer/pilot feedback.
- Refresh golden safety cases.
- Sample preservation reports for missed instruction drops.
- Review storage growth from replay case reports and traces.
- Decide whether to promote features from Should Have to Must Have.

### Quarterly

- Revisit risk model and tool policy profiles.
- Review whether LLM surrogate calibration meets quality target.
- Decide whether to add audit export or per-deployment policy UI.
- Retire stale replay cases or update case selection logic.

## 8. Data Retention And Privacy

Replay reports may contain task text, tool arguments, schemas, final answers, and side-effect descriptions. Treat them as sensitive operational data.

Recommended policy:

- Store summarized report for normal review.
- Limit raw case report retention or restrict access to admins.
- Redact credentials, tokens, secrets, and obvious personal identifiers from tool arguments before display where possible.
- Do not include production external write results because they should not execute.
- Define tenant-specific retention before enterprise rollout.

## 9. Release Communication

### Internal Message

Skill Replay Eval adds evidence to skill publishing. Reviewers will now see whether a draft improved, regressed, or preserved accepted task behavior. Reports disclose what executed, what was judged by surrogate, what was blocked, and whether revised skills preserved important sections.

### Customer / Pilot Message

Beaver can now evaluate reusable skill drafts against prior accepted work before publication. The report shows both confidence and uncertainty. Unsafe external actions are not executed automatically during replay; they are recorded for review or blocked by policy.

### Known Limitations To Disclose

- Replay quality depends on available accepted historical runs.
- Surrogate evaluation is not the same as real execution.
- Low-confidence reports require more human review.
- Human approval is still required.
- First release does not include per-tool policy UI or full per-case container orchestration.

## 10. Rollback Plan

Rollback options:

1. Disable replay runner injection and fall back to heuristic eval.
2. Keep report fields but set mode to `heuristic`.
3. Keep publish gate requiring safety and human review.
4. Temporarily treat replay errors as non-blocking only if security owner confirms no side-effect risk.
5. Preserve failed replay reports for debugging.

Rollback triggers:

- Any P0 side-effect incident.
- Repeated replay errors that block normal skill review.
- Provider unavailable spike that makes most reports skipped.
- Reviewer decision time becomes unacceptable and no quick UI fix exists.

## 11. Launch Checklist

- [ ] Backend replay tests pass.
- [ ] Frontend report rendering verified.
- [ ] Golden tool safety cases pass.
- [ ] No production side-effect path found.
- [ ] Publish gates tested manually.
- [ ] Skipped-provider copy is clear.
- [ ] Reviewer decision summary exists or is tracked as a launch follow-up.
- [ ] Pilot participants selected.
- [ ] Metrics dashboard owner assigned.
- [ ] Incident owner and escalation path assigned.
- [ ] Rollback path verified.