# PRD: Skill Replay Eval

Date: 2026-06-09

Status: Product discovery complete; implementation validation required

## 1. Summary

Skill Replay Eval is Beaver's evidence-based quality gate for reusable Agent skills. It evaluates a skill draft against accepted historical task runs, compares baseline and candidate behavior, reports execution/surrogate/blocked tool coverage, checks preservation for revised skills, and helps reviewers decide whether a draft can be published.

The goal is not to replace human review. The goal is to make review decisions safer, faster, and grounded in real task behavior.

## 2. Contacts

| Role | Owner | Comment |
| --- | --- | --- |
| Product | TBD | Owns scope, rollout, customer research, metrics |
| Engineering | TBD | Owns replay runner, tool policy, eval report, UI wiring |
| Design | TBD | Owns reviewer decision flow and report comprehension |
| Security / IT reviewer | TBD | Owns replay side-effect policy and launch approval |
| Customer pilot lead | TBD | Owns pilot participant selection and feedback loop |

## 3. Background

Beaver's product promise is that successful AI tasks can become reusable skills. This is valuable only if skill publishing is trustworthy. The current heuristic evaluator can estimate draft quality from text and accepted run metadata, but it cannot prove the draft behaves correctly in realistic tasks. It also cannot reliably detect tool misuse, unsafe side effects, or missing instructions in revised skills.

The new design introduces replay-style evaluation:

- Select accepted historical task cases.
- Run a baseline arm and a candidate arm.
- Execute safe tools in a replay context.
- Record unsafe or unavailable tools for surrogate judgment.
- Block destructive actions.
- Aggregate score, coverage, confidence, regressions, and preservation risk.
- Show the report in the Skills review page.
- Use publish gates to prevent low-confidence or unsafe releases.

Why now:

- Beaver already has task evidence, accepted runs, skill candidates, skill drafts, safety reports, eval reports, review, and publish flow.
- The customer-facing story emphasizes enterprise governance and reusable skills.
- Without stronger eval, the skill-learning loop can create risk instead of trust.

## 4. Objective

### Objective

Make skill publishing evidence-based and safe enough for enterprise pilot use.

### Why It Matters

For customers, Skill Replay Eval turns Beaver from "an Agent that can learn" into "an Agent platform with controlled learning." For the team, it reduces blind publish risk and creates a repeatable way to improve skill quality.

### Key Results

| Key Result | Target |
| --- | --- |
| Trusted Skill Publish Rate | >=80% of approved drafts have replay evidence or explicit skipped-provider evidence during pilot |
| Replay Side-Effect Safety | 0 production side-effect incidents caused by replay |
| Reviewer Decision Time | Median approve/reject/revise decision under 10 minutes for common drafts |
| Report Comprehension | >=80% of reviewers correctly explain execution, surrogate, blocked, and confidence meanings in usability tests |
| Regression Visibility | 100% of replay reports expose regression count, score delta, and case-level details |
| Preservation Visibility | 100% of revise/merge replay reports with base content expose preservation result |

## 5. Market Segments

### Primary Segment: Enterprise AI Platform Teams

They want private or controlled Agent deployment, reusable workflows, governance, and auditability. They need evidence before reusable skills are distributed.

### Secondary Segment: Internal Workflow Teams

They run repeatable knowledge workflows such as reports, support, project delivery, file processing, or research. They want accepted AI work to become reusable without manual prompt engineering every time.

### Internal Segment: Beaver Operators And Engineers

They need debuggable replay behavior, predictable tool policies, and operational visibility.

### Constraints

- Replay must not execute production external writes by default.
- Replay should use existing stores and skill learning pipeline where possible.
- Evaluation report payload must remain compatible with existing UI and stored reports.
- First release should cap replay case count to control latency and cost.
- Human review remains mandatory.

## 6. Value Propositions

### For Skill Reviewers

Pain avoided: approving a skill by reading text only.

Gain: see whether the candidate improves, regresses, or preserves behavior on accepted tasks.

### For Enterprise Admins

Pain avoided: uncontrolled AI learning that silently changes team behavior.

Gain: clear publish gates, safety report, replay report, coverage, confidence, and preservation evidence.

### For Workflow Owners

Pain avoided: successful task patterns disappearing into chat history.

Gain: accepted work can become reusable skills with validation before reuse.

### For Engineers

Pain avoided: debugging vague "skill quality" complaints.

Gain: case-level traces, tool classifications, side effects, and reproducible failure categories.

## 7. Solution

### 7.1 UX / User Flow

Primary reviewer flow:

```text
Skill candidate generated
  -> draft created
  -> safety report generated
  -> replay eval report generated
  -> reviewer opens Skills draft page
  -> reviewer reads summary: pass/fail, baseline, candidate, delta, coverage, confidence
  -> reviewer drills into cases, tool calls, side effects, preservation report
  -> reviewer approves, requests revision, or rejects
  -> publish gate enforces safety, eval, confidence, blocked coverage, preservation
```

Required UI behavior:

- Show report status first: passed, failed, skipped provider, replay error, or partial.
- Show baseline average, candidate average, and score delta.
- Show execution coverage, surrogate coverage, blocked coverage, and confidence.
- Show improved, regressed, and unchanged case counts.
- Show replay cases in a compact table.
- Show raw case reports only after the summary.
- Show preservation report for revise/merge drafts.
- Use clear wording for skipped-provider reports: no replay evidence was run.

Recommended UI improvement:

- Add a reviewer decision summary above raw details:
  - "Recommended action: Approve / Revise / Reject / Needs manual review"
  - "Reason: low confidence, preservation failure, regression, or blocked calls"

### 7.2 Key Features

#### Historical Case Selection

Requirements:

- Select up to 10 accepted historical runs.
- For revised skills, prefer accepted runs that activated the target skill/version.
- For new skills, use candidate source runs or similar task themes.
- For merged skills, use accepted runs where related skills co-activated.
- Prefer recent accepted runs and diversify repeated tasks.

Acceptance criteria:

- Case selection returns no more than 10 cases.
- Failed or unaccepted runs are excluded.
- Baseline skill names are populated for revise and merge candidates.

#### Baseline And Candidate Replay Arms

Requirements:

- Run the same task text for both arms.
- Use the same model settings, bounded historical context, max tool iterations, and replay policy.
- Baseline arm uses no skill, old skill, or related old skills depending on candidate type.
- Candidate arm injects the draft as pinned draft guidance.

Acceptance criteria:

- Both arms produce run id, session id, final answer, finish reason, tool calls, side effects, and artifacts.
- Replay runs are marked with source `skill_replay_eval`.
- Replay does not create user-visible normal task sessions.

#### Tool Mode Classification

Requirements:

- Classify each tool call as:
  - `executed`: safe to execute in replay context.
  - `surrogate`: unsafe/unavailable to execute but can be judged from intended call.
  - `blocked`: cannot safely execute or judge.
- Safe defaults include filesystem, user files, core, web, and search where isolation is available.
- External writes and connector/MCP write actions default to surrogate.
- Destructive operations default to blocked.

Acceptance criteria:

- Each tool trace includes tool name, arguments, schema, toolset, metadata, mode, classification reason, and result.
- Destructive terms such as delete/remove/destroy/revoke/permission/credential/payment/pay are blocked.
- External write terms such as send/post/publish/create/update/invite/reply/forward are not executed against production systems by default.

#### Surrogate Evaluation

Requirements:

- Score baseline and candidate intended tool use when tools are surrogate or blocked.
- Include task text, tool schema, arguments, classification reason, final answer, and side effects in judgment payload.
- Lower confidence when surrogate or blocked coverage is high.

Acceptance criteria:

- Reports include baseline score, candidate score, delta, confidence, and validator notes.
- Blocked calls reduce score and confidence.
- Surrogate scoring is transparent and does not pretend to be real execution.

#### Preservation Check

Requirements:

- For revise and merge drafts, compare base skill content against proposed draft content.
- Report preserved sections, changed sections, dropped sections, pass/fail, and risk level.
- Failed preservation blocks publish.

Acceptance criteria:

- Revision drafts with dropped important sections fail preservation.
- Reports are visible in the Skills UI.
- Publish gate blocks failed preservation.

#### Eval Report Model

Requirements:

- Extend existing `SkillDraftEvalReport` without breaking legacy reports.
- Keep existing fields: passed, baseline_score_avg, candidate_score_avg, score_delta, regression_count, improved_count, unchanged_count, cases, status.
- Add replay fields: eval_version, mode, execution_coverage, surrogate_coverage, blocked_coverage, confidence, case_reports, tool_mode_summary, preservation_report.

Acceptance criteria:

- Legacy reports deserialize with default replay fields.
- New reports serialize all replay fields.
- Frontend type definitions include replay fields.

#### Publish Gates

Requirements:

- Draft must still have approved review and passing safety report.
- Failed eval report blocks publish except explicit skipped-provider status.
- Replay report with low confidence blocks publish.
- Replay report with blocked coverage >=1.0 blocks publish.
- Failed preservation blocks publish.

Acceptance criteria:

- Publish attempts fail with clear errors for each gate condition.
- Skipped provider is visible and does not silently claim replay passed.

### 7.3 Technology

Backend:

- Python dataclasses.
- Existing file-backed memory stores.
- `SkillLearningPipelineService.evaluate_draft()`.
- `SkillDraftEvaluator`.
- `ReplayRunner`, `ReplayToolExecutor`, `ReplayToolPolicy`.
- `SurrogateToolEvaluator`.
- `SkillDraftEvalReport`.
- FastAPI endpoint wiring through existing Skills APIs.

Frontend:

- Next.js / TypeScript Skills page.
- Existing design system and report card patterns.
- Typed replay report fields in `types/index.ts`.

Testing:

- Unit tests for eval report compatibility.
- Case selection tests.
- Preservation tests.
- Replay executor and replay runner tests.
- Agent loop replay executor override tests.
- Surrogate scoring tests.
- Pipeline publish gate tests.
- Frontend smoke/manual review for report rendering.

### 7.4 Data Model

Eval report fields:

| Field | Type | Purpose |
| --- | --- | --- |
| `eval_version` | string | Version of eval model, e.g. `replay-v1` |
| `mode` | string | `heuristic` or `replay` |
| `execution_coverage` | number | Share of replay tool calls actually executed |
| `surrogate_coverage` | number | Share judged through surrogate |
| `blocked_coverage` | number | Share blocked |
| `confidence` | string | low, medium, high |
| `case_reports` | array | Detailed baseline/candidate case reports |
| `tool_mode_summary` | object | Aggregate tool mode counts |
| `preservation_report` | object/null | Preservation result for revise/merge |

### 7.5 Assumptions

- Accepted historical runs exist and are useful.
- Replay can be isolated enough for safe tool execution.
- Reviewers understand and trust the report after UI iteration.
- Surrogate scoring can be improved over time without blocking v1.
- Publish gates can be calibrated during pilot.

### 7.6 Non-Goals

- No production third-party writes during automatic replay.
- No automatic publishing based only on replay score.
- No full Docker orchestration per replay case in v1.
- No customer-configurable per-tool policy UI in v1.
- No replacement of human review.
- No claim that replay is a complete benchmark of all future tasks.

## 8. Release

### V0: Internal Validation

Scope:

- Current replay report fields.
- Current case selection.
- Current replay runner integration.
- Current tool policy.
- Current Skills UI report display.
- Current publish gates.

Exit criteria:

- Unit tests pass for skill learning replay surface.
- Golden tool policy tests prove no production side effects.
- Reviewer can make decisions from 5 seeded cases.
- Known limitations are documented.

### V1: Pilot Release

Scope:

- Reviewer decision summary.
- Replay readiness indicator.
- Better preservation diff.
- Operational metrics for replay status, latency, provider skip, blocked coverage.
- Customer-facing explanation for replay evidence and confidence.

Exit criteria:

- 0 replay side-effect incidents.
- >=80% reviewer comprehension in usability test.
- Median reviewer decision time under 10 minutes.
- Pilot admins accept report as sufficient review support.

### V2: Enterprise Hardening

Scope:

- LLM surrogate evaluator with human-labeled calibration.
- Policy profiles by deployment risk tier.
- Audit export.
- Skill quality trend across versions.
- Replay operations dashboard.

Exit criteria:

- Human vs surrogate agreement >=80% on unsafe tool golden set.
- Clear process for policy changes and incident review.
- Enterprise pilot customers can use audit export in governance review.

## Open Questions

- What minimum replay case count should be required before a report is considered useful?
- Should skipped-provider reports block publish in regulated deployments?
- What exact confidence levels should map to publish gate behavior?
- Which toolsets are safe in each deployment mode?
- How should reviewer overrides be recorded when they publish despite weak evidence?
- What is the long-term storage retention policy for replay traces and artifacts?

## Success Review Checklist

- Product: Does the report answer "should this skill be published?"
- Design: Can reviewers understand the summary without reading raw JSON?
- Engineering: Can replay failures be reproduced and diagnosed?
- Security: Can replay prove no production side effects by default?
- Customer: Does this strengthen Beaver's enterprise trust story?