Files

steven_li a28254c6b8 docs: design skill replay eval

2026-06-08 10:29:39 +08:00

8.2 KiB

Raw Blame History

Skill Replay Eval Design

Goal

Improve skill draft evaluation so it measures real task behavior instead of relying on heuristic draft scoring. The new evaluation must cover every tool involved in a skill, while separating tools that can be executed safely from tools that require LLM surrogate judgment.

This design also fixes revision draft generation dropping important content from the original skill by making base skill preservation an explicit contract.

Current State

SkillDraftEvaluator currently builds a lightweight report from candidate.source_run_ids. It scores each historical run from validation_result.score or success fallback, then estimates candidate score from draft text. It does not replay the task, does not execute tools, and does not compare old skill behavior with draft skill behavior.

SkillDraftSynthesizer currently receives candidate reason, related skill names, tool names, task summaries, and session excerpts. For revision and merge drafts, it does not receive the full base skill frontmatter and body, so generated drafts can accidentally omit important original instructions.

Design Principles

All tools are part of evaluation coverage.
Safe tools execute in an isolated replay environment.
Unsafe or unavailable tools are not ignored; they are evaluated through an LLM surrogate using intended tool calls, schema, arguments, historical evidence, and expected effects.
Evaluation reports must disclose execution coverage and surrogate coverage separately.
Revision drafts must preserve original skill content unless a change is explicitly justified.
Replay runs must not write to production workspace, user files, memory, third-party accounts, or external systems by default.

Evaluation Model

Each draft eval selects 3 to 5 historical cases.

For revise_skill, select accepted historical runs that activated the target skill/version. Prefer recent accepted runs, then diversify by task and session.

For new_skill, select candidate source runs and accepted runs with similar task themes.

For merge_skills, select accepted runs where the related skills co-activated.

Each case runs two arms:

Baseline arm: no skill for new_skill, old skill for revise_skill, or old related skills for merge_skills.
Candidate arm: draft skill injected as pinned draft guidance.

Both arms use the same task text, same bounded historical context, same model settings, same max tool iterations, and same replay policy.

Tool Execution Modes

Each tool call in replay resolves to one of these modes:

executed: Tool was safely executed in replay environment.
surrogate: Tool was not executed, but the intended call and expected effect were evaluated by LLM.
blocked: Tool could not be executed or judged reliably.

The goal is not to exclude third-party tools. It is to include them with the strongest safe evaluation method available.

Examples:

Filesystem reads and writes run against a temporary workspace clone.
User file writes run against a temporary user-file namespace when available.
Web/search reads can execute and cache outputs.
Email/calendar/message sending to production systems does not execute by default. The replay records the intended call and evaluates it through surrogate judgment unless a sandbox/test connector is configured.
Destructive actions such as delete, payment, permission changes, or irreversible external writes default to surrogate or blocked.

Replay Environment

The replay runner creates isolated state per case and arm:

Temporary session id.
Temporary workspace root.
Temporary task id or replay id.
Tool call trace.
Output artifacts.
Side-effect journal.
Captured final answer.

This follows the OfficeBench MCP pattern: run tools in an isolated testbed where possible, pull artifacts and state after execution, then evaluate outside the runner. Beaver should reuse this shape without depending on OfficeBench's fixed benchmark functions.

Surrogate Evaluation

When a tool cannot be safely executed, the agent should still be allowed to plan or attempt the tool call. The replay layer records:

Tool name.
Tool schema.
Arguments.
Tool classification reason.
Historical accepted evidence.
Expected side effect inferred from the task.
Any assistant rationale around the call.

The surrogate evaluator compares baseline and candidate intended effects. It scores whether the intended tool use would satisfy the task, whether arguments are complete and correct, and whether the call is risky, missing, duplicated, or unnecessary.

Surrogate scoring contributes to the final candidate score, but lowers confidence compared with real execution.

Scoring

Each case produces:

baseline_score
candidate_score
delta
execution_coverage
surrogate_coverage
blocked_tool_count
confidence
tool_calls
artifacts
side_effects
validator_notes

The draft report aggregates:

Baseline mean.
Candidate mean.
Score delta.
Improved count.
Regression count.
Unchanged count.
Execution coverage.
Surrogate coverage.
Blocked coverage.
Confidence.

Publish gates should consider both score and confidence. A passing score with low confidence should require stronger human review, not automatic trust.

Draft Preservation

Revision and merge synthesis must include base skill snapshots:

Base skill name.
Base version.
Full base frontmatter.
Full base content.
Tool hints.
Current published summary.

The synthesis prompt must require the model to preserve existing instructions unless it explicitly changes them. The output remains a full proposed skill body, but it should also include:

preserved_sections
changed_sections
dropped_sections
change_reason

After generation, a preservation checker compares base content and draft content. If critical sections disappear without explanation, the draft eval should mark preservation risk and require revision before approval.

API And Storage

The existing SkillDraftEvalReport should be extended rather than replaced.

Add fields for:

eval_version
mode, with values such as heuristic, replay
execution_coverage
surrogate_coverage
blocked_coverage
confidence
case_reports
tool_mode_summary
preservation_report

The existing simple fields remain for UI compatibility: passed, baseline_score_avg, candidate_score_avg, score_delta, regression_count, improved_count, unchanged_count, cases, and status.

UI

The Skills draft review page should continue to show a concise summary first:

Passed or failed.
Baseline mean.
Candidate mean.
Delta.
Execution coverage.
Surrogate coverage.
Confidence.

Detailed sections show:

Replay cases.
Tool calls by mode.
Blocked or surrogate reasons.
Artifacts and side effects.
Preservation report for revision drafts.
Raw eval payload.

The user should not need to configure per-tool policies for normal use. The report should explain coverage and uncertainty after the fact.

Error Handling

If replay infrastructure fails before any case runs, eval status is replay_error and the draft cannot rely on replay pass.

If some cases fail but others complete, eval status is partial and confidence is reduced.

If a provider is unavailable, keep the current skipped-provider behavior but mark the report as no replay evidence.

If all important tool calls become blocked, the draft should not pass automatically even if surrogate scoring is high.

Testing

Unit tests should cover:

Historical case selection for new, revise, merge candidates.
Baseline and candidate arm construction.
Tool mode classification and aggregation.
Surrogate scoring payload construction.
Preservation checker behavior.
Publish gate behavior for low-confidence or blocked reports.

Integration-style tests should use stub tools:

A safe filesystem write tool that writes to temp workspace.
An external write tool that is intercepted into surrogate mode.
A mixed case where candidate improves one real artifact and one surrogate side effect.

Out Of Scope

Real production third-party writes during automatic replay.
Full Docker orchestration for all Beaver replay cases in the first implementation.
Per-tool user policy UI.
Replacing human review. Replay improves evidence but does not remove review gates.

8.2 KiB Raw Blame History