71 KiB
Skill Replay Eval Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Replace heuristic-only skill draft evaluation with replay-style reports that cover all tools through either safe execution or LLM surrogate judgment, while preserving base skill content during draft synthesis.
Architecture: Extend the existing skill learning pipeline instead of replacing it. Add replay report fields to SkillDraftEvalReport, introduce focused helper modules under beaver/skills/learning/, then wire the enhanced evaluator through the existing SkillLearningPipelineService.evaluate_draft() path and Skills UI.
Tech Stack: Python dataclasses, existing file-backed stores, pytest, FastAPI response payloads, Next.js/TypeScript Skills page components.
File Structure
Create focused backend modules:
app-instance/backend/beaver/skills/learning/case_selection.pyselects up to 10 historical replay cases.app-instance/backend/beaver/skills/learning/preservation.pybuilds base skill snapshots and checks draft preservation.app-instance/backend/beaver/skills/learning/replay.pydefines replay tool policy, trace executor, arm reports, and the AgentLoop-backed replay runner.app-instance/backend/beaver/skills/learning/surrogate.pyscores non-executed tool calls with an LLM-backed or deterministic fallback surrogate.
Modify existing backend modules:
app-instance/backend/beaver/memory/skills/models.pyextendsSkillDraftEvalReportwith replay fields while preserving old payload compatibility.app-instance/backend/beaver/skills/learning/eval.pyorchestrates replay eval and keeps heuristic fallback.app-instance/backend/beaver/skills/learning/synthesizer.pyincludes base skill snapshots for revision and merge prompts.app-instance/backend/beaver/skills/learning/service.pypassesSkillSpecStorecontext into synthesis helpers where needed.app-instance/backend/beaver/skills/learning/pipeline.pyupdates publish gate confidence checks.app-instance/backend/beaver/engine/loop.pyaccepts a replay executor override for isolated tool execution.app-instance/backend/beaver/interfaces/web/app.pyinjectsReplayRunner(agent_loop=loop)for draft evaluation requests.
Modify frontend:
app-instance/frontend/types/index.tsadds replay report fields.app-instance/frontend/app/(app)/skills/page.tsxshows execution coverage, surrogate coverage, confidence, replay cases, and preservation details.
Add focused tests:
app-instance/backend/tests/unit/test_skill_learning_eval_report_model.pyapp-instance/backend/tests/unit/test_skill_learning_case_selection.pyapp-instance/backend/tests/unit/test_skill_learning_preservation.pyapp-instance/backend/tests/unit/test_skill_learning_replay.pyapp-instance/backend/tests/unit/test_skill_learning_replay_runner.pyapp-instance/backend/tests/unit/test_agent_loop_replay_executor.pyapp-instance/backend/tests/unit/test_skill_learning_surrogate.py- Extend
app-instance/backend/tests/unit/test_skill_learning_eval.py - Extend
app-instance/backend/tests/unit/test_skill_learning_pipeline.py
Task 1: Extend Eval Report Model Compatibly
Files:
-
Modify:
app-instance/backend/beaver/memory/skills/models.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py -
Step 1: Write failing model compatibility tests
Create app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py:
from __future__ import annotations
from beaver.memory.skills import SkillDraftEvalReport
def test_eval_report_defaults_preserve_legacy_payload_shape() -> None:
report = SkillDraftEvalReport(
report_id="eval-1",
skill_name="debug",
draft_id="draft-1",
candidate_id="candidate-1",
passed=True,
baseline_score_avg=0.5,
candidate_score_avg=0.8,
score_delta=0.3,
regression_count=0,
improved_count=2,
unchanged_count=0,
cases=[{"run_id": "run-1"}],
status="completed",
created_at="now",
)
payload = report.to_dict()
assert payload["eval_version"] == "heuristic-v1"
assert payload["mode"] == "heuristic"
assert payload["execution_coverage"] == 0.0
assert payload["surrogate_coverage"] == 0.0
assert payload["blocked_coverage"] == 0.0
assert payload["confidence"] == "low"
assert payload["case_reports"] == []
assert payload["tool_mode_summary"] == {}
assert payload["preservation_report"] is None
assert payload["cases"] == [{"run_id": "run-1"}]
def test_eval_report_reads_legacy_payload_without_replay_fields() -> None:
report = SkillDraftEvalReport.from_dict(
{
"report_id": "eval-legacy",
"skill_name": "debug",
"draft_id": "draft-1",
"candidate_id": "candidate-1",
"passed": True,
"baseline_score_avg": 0.4,
"candidate_score_avg": 0.8,
"score_delta": 0.4,
"regression_count": 0,
"improved_count": 1,
"unchanged_count": 0,
"cases": [{"run_id": "run-1"}],
"status": "completed",
"created_at": "now",
}
)
assert report.eval_version == "heuristic-v1"
assert report.mode == "heuristic"
assert report.confidence == "low"
assert report.case_reports == []
- Step 2: Run the new tests to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_eval_report_model.py -v
Expected: FAIL because SkillDraftEvalReport has no replay fields.
- Step 3: Extend the dataclass
In app-instance/backend/beaver/memory/skills/models.py, add fields to SkillDraftEvalReport after created_at:
eval_version: str = "heuristic-v1"
mode: str = "heuristic"
execution_coverage: float = 0.0
surrogate_coverage: float = 0.0
blocked_coverage: float = 0.0
confidence: str = "low"
case_reports: list[dict[str, Any]] = field(default_factory=list)
tool_mode_summary: dict[str, Any] = field(default_factory=dict)
preservation_report: dict[str, Any] | None = None
Update to_dict() to include these fields:
"eval_version": self.eval_version,
"mode": self.mode,
"execution_coverage": self.execution_coverage,
"surrogate_coverage": self.surrogate_coverage,
"blocked_coverage": self.blocked_coverage,
"confidence": self.confidence,
"case_reports": [dict(item) for item in self.case_reports],
"tool_mode_summary": dict(self.tool_mode_summary),
"preservation_report": dict(self.preservation_report) if self.preservation_report is not None else None,
Update from_dict() with bounded numeric parsing:
eval_version=str(payload.get("eval_version") or "heuristic-v1"),
mode=str(payload.get("mode") or "heuristic"),
execution_coverage=_bounded_float(payload.get("execution_coverage"), default=0.0),
surrogate_coverage=_bounded_float(payload.get("surrogate_coverage"), default=0.0),
blocked_coverage=_bounded_float(payload.get("blocked_coverage"), default=0.0),
confidence=str(payload.get("confidence") or "low"),
case_reports=[dict(item) for item in payload.get("case_reports") or [] if isinstance(item, dict)],
tool_mode_summary=dict(payload.get("tool_mode_summary") or {}),
preservation_report=(
dict(payload["preservation_report"])
if isinstance(payload.get("preservation_report"), dict)
else None
),
Add helper near _optional_float:
def _bounded_float(value: Any, *, default: float = 0.0) -> float:
if value in (None, ""):
return default
try:
return max(0.0, min(1.0, float(value)))
except (TypeError, ValueError):
return default
- Step 4: Run model tests
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_eval_report_model.py -v
Expected: PASS.
- Step 5: Run existing eval store test
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_candidate_state.py::test_reports_are_stored_and_retrieved -v
Expected: PASS.
- Step 6: Commit
git add app-instance/backend/beaver/memory/skills/models.py app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py
git commit -m "feat(skill-learning): extend eval report payload"
Task 2: Add Base Skill Preservation Helpers
Files:
-
Create:
app-instance/backend/beaver/skills/learning/preservation.py -
Modify:
app-instance/backend/beaver/skills/learning/__init__.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_preservation.py -
Step 1: Write failing preservation tests
Create app-instance/backend/tests/unit/test_skill_learning_preservation.py:
from __future__ import annotations
from beaver.skills.learning.preservation import check_preservation
def test_preservation_passes_when_base_sections_remain() -> None:
base = "# Skill\n\n## Workflow\n\n- Read first.\n\n## Safety\n\n- Do not delete files.\n"
draft = "# Skill\n\n## Workflow\n\n- Read first.\n- Then write.\n\n## Safety\n\n- Do not delete files.\n"
report = check_preservation(base_content=base, draft_content=draft)
assert report["passed"] is True
assert report["risk_level"] == "low"
assert "Workflow" in report["preserved_sections"]
assert "Safety" in report["preserved_sections"]
assert report["dropped_sections"] == []
def test_preservation_flags_dropped_section() -> None:
base = "# Skill\n\n## Workflow\n\n- Read first.\n\n## Safety\n\n- Do not delete files.\n"
draft = "# Skill\n\n## Workflow\n\n- Read first.\n"
report = check_preservation(base_content=base, draft_content=draft)
assert report["passed"] is False
assert report["risk_level"] == "high"
assert "Safety" in report["dropped_sections"]
- Step 2: Run tests to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_preservation.py -v
Expected: FAIL because preservation.py does not exist.
- Step 3: Implement preservation helpers
Create app-instance/backend/beaver/skills/learning/preservation.py:
"""Preservation checks for skill revision drafts."""
from __future__ import annotations
import re
from typing import Any
def check_preservation(*, base_content: str, draft_content: str) -> dict[str, Any]:
base_sections = _sections(base_content)
draft_sections = _sections(draft_content)
preserved: list[str] = []
changed: list[str] = []
dropped: list[str] = []
for heading, body in base_sections.items():
draft_body = draft_sections.get(heading)
if draft_body is None:
dropped.append(heading)
continue
if _normalize(body) == _normalize(draft_body):
preserved.append(heading)
else:
changed.append(heading)
risk_level = "high" if dropped else ("medium" if changed else "low")
return {
"passed": not dropped,
"risk_level": risk_level,
"preserved_sections": preserved,
"changed_sections": changed,
"dropped_sections": dropped,
}
def _sections(content: str) -> dict[str, str]:
current = "body"
sections: dict[str, list[str]] = {current: []}
for line in (content or "").splitlines():
match = re.match(r"^#{1,6}\s+(.+?)\s*$", line)
if match:
current = match.group(1).strip()
sections.setdefault(current, [])
continue
sections.setdefault(current, []).append(line)
return {heading: "\n".join(lines).strip() for heading, lines in sections.items() if "\n".join(lines).strip()}
def _normalize(value: str) -> str:
return re.sub(r"\s+", " ", value or "").strip().lower()
Export helpers from app-instance/backend/beaver/skills/learning/__init__.py:
from .preservation import check_preservation
Add "check_preservation" to __all__.
- Step 4: Run preservation tests
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_preservation.py -v
Expected: PASS.
- Step 5: Commit
git add app-instance/backend/beaver/skills/learning/preservation.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_preservation.py
git commit -m "feat(skill-learning): add draft preservation checks"
Task 3: Include Base Skill Snapshots During Synthesis
Files:
-
Modify:
app-instance/backend/beaver/skills/learning/synthesizer.py -
Modify:
app-instance/backend/beaver/skills/learning/service.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_synthesizer_preservation.py -
Step 1: Write failing prompt test
Create app-instance/backend/tests/unit/test_skill_learning_synthesizer_preservation.py:
from __future__ import annotations
from beaver.memory.skills import SkillLearningCandidate
from beaver.skills.learning.evidence import EvidencePacket
from beaver.skills.learning.synthesizer import SkillDraftSynthesizer
def test_revision_prompt_includes_base_skill_snapshot() -> None:
candidate = SkillLearningCandidate(
candidate_id="candidate-1",
kind="revise_skill",
source_run_ids=["run-1"],
source_session_ids=["session-1"],
related_skill_names=["debug-skill"],
reason="Improve debugging flow.",
)
packet = EvidencePacket(
run_ids=["run-1"],
session_ids=["session-1"],
task_summaries=["debug a failing test"],
session_excerpts=["assistant: fixed it"],
)
prompt = SkillDraftSynthesizer._build_prompt(
candidate,
packet,
"revise",
base_skill={
"skill_name": "debug-skill",
"version": "v0001",
"frontmatter": {"description": "Debug tests", "tools": ["read_file"]},
"content": "# Debug Skill\n\n## Safety\n\nDo not delete files.",
"summary": "Debug tests safely.",
"tool_hints": ["read_file"],
},
)
assert "Base skill snapshot" in prompt
assert "# Debug Skill" in prompt
assert "Do not delete files." in prompt
assert "preserved_sections" in prompt
assert "dropped_sections" in prompt
- Step 2: Run test to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_synthesizer_preservation.py -v
Expected: FAIL because _build_prompt() does not accept base_skill.
- Step 3: Update synthesizer signatures
In app-instance/backend/beaver/skills/learning/synthesizer.py, change these signatures:
async def synthesize_revision(
self,
candidate: SkillLearningCandidate,
evidence_packet: EvidencePacket,
provider: LLMProvider,
model: str,
base_skill: dict[str, Any] | None = None,
) -> dict[str, Any]:
return await self._synthesize(candidate, evidence_packet, provider, model, "revise", base_skill=base_skill)
Apply the same optional base_skill parameter to synthesize_merge() and _synthesize(). Keep synthesize_new_skill() unchanged except for calling _synthesize(..., base_skill=None).
- Step 4: Update prompt builder
Update _build_prompt() signature:
def _build_prompt(
candidate: SkillLearningCandidate,
evidence_packet: EvidencePacket,
action: str,
base_skill: dict[str, Any] | None = None,
) -> str:
Before the return, build a base section:
base_section = ""
if base_skill:
base_section = (
"\n\nBase skill snapshot:\n"
f"- skill_name: {base_skill.get('skill_name')}\n"
f"- version: {base_skill.get('version')}\n"
f"- frontmatter: {json.dumps(base_skill.get('frontmatter') or {}, ensure_ascii=False, sort_keys=True)}\n"
f"- tool_hints: {base_skill.get('tool_hints') or []}\n"
f"- summary: {base_skill.get('summary') or ''}\n"
"Base skill content:\n"
f"{base_skill.get('content') or ''}\n"
"Preserve existing instructions unless the evidence requires a change. "
"If any section is changed or dropped, explain it in changed_sections or dropped_sections."
)
Append base_section before the JSON instructions. Extend the JSON instruction:
+ "\nThe JSON may include preserved_sections, changed_sections, and dropped_sections arrays."
- Step 5: Preserve parsed metadata from model output
In _parse_payload(), include optional arrays:
"preserved_sections": _coerce_string_list(payload.get("preserved_sections")),
"changed_sections": _coerce_string_list(payload.get("changed_sections")),
"dropped_sections": _coerce_string_list(payload.get("dropped_sections")),
In _normalize_payload(), return those fields as well.
- Step 6: Add base skill snapshot lookup in service
In app-instance/backend/beaver/skills/learning/service.py, import SkillSpecStore only if needed by type checking is not enough. The service already has draft_service.store; use it to read published skills.
Add helper:
def _base_skill_snapshot(self, skill_name: str, version: str | None) -> dict[str, Any] | None:
loaded = self.draft_service.store.read_published_skill(skill_name, version)
if loaded is None:
return None
return {
"skill_name": loaded.version.skill_name,
"version": loaded.version.version,
"frontmatter": dict(loaded.version.frontmatter),
"content": loaded.content,
"summary": loaded.version.summary,
"tool_hints": list(loaded.version.tool_hints),
}
Pass this snapshot to synthesize_revision() and synthesize_merge().
- Step 7: Run synthesizer test
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_synthesizer_preservation.py -v
Expected: PASS.
- Step 8: Run existing learning tests
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_eval.py tests/unit/test_skill_learning_worker.py -v
Expected: PASS.
- Step 9: Commit
git add app-instance/backend/beaver/skills/learning/synthesizer.py app-instance/backend/beaver/skills/learning/service.py app-instance/backend/tests/unit/test_skill_learning_synthesizer_preservation.py
git commit -m "feat(skill-learning): preserve base skill during synthesis"
Task 4: Add Historical Replay Case Selection
Files:
-
Create:
app-instance/backend/beaver/skills/learning/case_selection.py -
Modify:
app-instance/backend/beaver/skills/learning/__init__.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_case_selection.py -
Step 1: Write failing case selection tests
Create app-instance/backend/tests/unit/test_skill_learning_case_selection.py:
from __future__ import annotations
from beaver.memory.runs import RunRecord
from beaver.memory.skills import SkillLearningCandidate
from beaver.skills.learning.case_selection import select_replay_cases
from beaver.skills.specs import SkillActivationReceipt
def _run(
run_id: str,
*,
task_id: str = "task",
session_id: str = "session",
task_text: str = "debug task",
skill_name: str | None = None,
skill_version: str = "v0001",
) -> RunRecord:
receipts = []
if skill_name:
receipts.append(
SkillActivationReceipt(
run_id=run_id,
session_id=session_id,
skill_name=skill_name,
skill_version=skill_version,
content_hash="hash",
activated_at="now",
activation_reason="selected",
)
)
return RunRecord(
run_id=run_id,
session_id=session_id,
task_id=task_id,
attempt_index=1,
task_text=task_text,
started_at=f"2026-06-08T00:00:{run_id[-2:]}+00:00",
ended_at="end",
success=True,
finish_reason="stop",
feedback={"acceptance_type": "accept"},
activated_skills=receipts,
)
def test_select_revise_cases_caps_at_ten_and_prefers_related_skill() -> None:
runs = [_run(f"run-{index:02d}", task_id=f"task-{index}", skill_name="debug", skill_version="v0001") for index in range(12)]
candidate = SkillLearningCandidate(
candidate_id="candidate-1",
kind="revise_skill",
source_run_ids=[],
source_session_ids=[],
related_skill_names=["debug"],
reason="revise",
evidence={"skill_version": "v0001"},
)
cases = select_replay_cases(candidate, runs)
assert len(cases) == 10
assert all(case["baseline_skill_names"] == ["debug"] for case in cases)
assert cases[0]["run_id"] == "run-11"
def test_select_new_skill_uses_all_available_source_runs_under_ten() -> None:
runs = [_run(f"run-{index:02d}", task_id=f"task-{index}") for index in range(3)]
candidate = SkillLearningCandidate(
candidate_id="candidate-1",
kind="new_skill",
source_run_ids=["run-00", "run-01", "run-02"],
source_session_ids=["session"],
related_skill_names=[],
reason="new",
)
cases = select_replay_cases(candidate, runs)
assert [case["run_id"] for case in cases] == ["run-02", "run-01", "run-00"]
assert all(case["baseline_skill_names"] == [] for case in cases)
- Step 2: Run tests to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_case_selection.py -v
Expected: FAIL because case_selection.py does not exist.
- Step 3: Implement selector
Create app-instance/backend/beaver/skills/learning/case_selection.py:
"""Historical replay case selection for skill draft evaluation."""
from __future__ import annotations
from typing import Any
from beaver.memory.runs import RunRecord
from beaver.memory.skills import SkillLearningCandidate
MAX_REPLAY_CASES = 10
def select_replay_cases(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[dict[str, Any]]:
accepted = [record for record in runs if _is_accepted(record)]
if candidate.kind == "revise_skill":
selected = _select_revise(candidate, accepted)
elif candidate.kind == "merge_skills":
selected = _select_merge(candidate, accepted)
else:
selected = _select_new(candidate, accepted)
return [_case_payload(candidate, record) for record in selected[:MAX_REPLAY_CASES]]
def _select_revise(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
target = candidate.related_skill_names[0] if candidate.related_skill_names else ""
version = str(candidate.evidence.get("skill_version") or "")
matches = [
record for record in runs
if any(receipt.skill_name == target and (not version or receipt.skill_version == version) for receipt in record.activated_skills)
]
return _recent_diverse(matches)
def _select_merge(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
targets = set(candidate.related_skill_names)
matches = [
record for record in runs
if targets and targets.issubset({receipt.skill_name for receipt in record.activated_skills})
]
return _recent_diverse(matches)
def _select_new(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
source_ids = set(candidate.source_run_ids)
if source_ids:
matches = [record for record in runs if record.run_id in source_ids]
else:
theme = str(candidate.evidence.get("theme") or "").lower().strip()
matches = [record for record in runs if theme and theme in record.task_text.lower()]
return _recent_diverse(matches)
def _case_payload(candidate: SkillLearningCandidate, record: RunRecord) -> dict[str, Any]:
baseline_skill_names = []
if candidate.kind == "revise_skill":
baseline_skill_names = list(candidate.related_skill_names[:1])
elif candidate.kind == "merge_skills":
baseline_skill_names = list(candidate.related_skill_names)
return {
"run_id": record.run_id,
"task_id": record.task_id,
"session_id": record.session_id,
"task_text": record.task_text,
"baseline_skill_names": baseline_skill_names,
"candidate_skill_name": candidate.draft_skill_name,
"accepted_score": _score(record),
}
def _recent_diverse(runs: list[RunRecord]) -> list[RunRecord]:
sorted_runs = sorted(runs, key=lambda item: (item.started_at, item.run_id), reverse=True)
result: list[RunRecord] = []
seen_tasks: set[str] = set()
for record in sorted_runs:
task_key = record.task_id or record.task_text
if task_key in seen_tasks and len(sorted_runs) > MAX_REPLAY_CASES:
continue
seen_tasks.add(task_key)
result.append(record)
if len(result) >= MAX_REPLAY_CASES:
break
if len(result) < min(len(sorted_runs), MAX_REPLAY_CASES):
seen_run_ids = {record.run_id for record in result}
result.extend(record for record in sorted_runs if record.run_id not in seen_run_ids)
return result[:MAX_REPLAY_CASES]
def _is_accepted(record: RunRecord) -> bool:
feedback = record.feedback or {}
acceptance = feedback.get("acceptance_type")
if acceptance is None and feedback.get("feedback_type") == "satisfied":
acceptance = "accept"
return bool(record.success) and acceptance == "accept"
def _score(record: RunRecord) -> float:
validation = record.validation_result or {}
value = validation.get("score") if isinstance(validation, dict) else None
if value is not None:
try:
return max(0.0, min(1.0, float(value)))
except (TypeError, ValueError):
pass
return 0.8 if record.success else 0.4
Export select_replay_cases from app-instance/backend/beaver/skills/learning/__init__.py.
- Step 4: Run selector tests
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_case_selection.py -v
Expected: PASS.
- Step 5: Commit
git add app-instance/backend/beaver/skills/learning/case_selection.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_case_selection.py
git commit -m "feat(skill-learning): select replay eval cases"
Task 5: Add Replay Tool Policy And Trace Executor
Files:
-
Create:
app-instance/backend/beaver/skills/learning/replay.py -
Modify:
app-instance/backend/beaver/skills/learning/__init__.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_replay.py -
Step 1: Write failing replay policy tests
Create app-instance/backend/tests/unit/test_skill_learning_replay.py:
from __future__ import annotations
import asyncio
from beaver.tools.base import BaseTool, ToolContext, ToolResult, ToolSpec
from beaver.tools.registry.tool_registry import ToolRegistry
from beaver.tools.runtime.executor import ToolExecutor
from beaver.skills.learning.replay import ReplayToolExecutor, ReplayToolPolicy, classify_tool_mode
class FakeTool(BaseTool):
def __init__(self, name: str, *, toolset: str = "filesystem", metadata: dict | None = None) -> None:
self._spec = ToolSpec(
name=name,
description=f"{name} tool",
input_schema={"type": "object", "properties": {"path": {"type": "string"}}},
toolset=toolset,
metadata=metadata or {},
)
@property
def spec(self) -> ToolSpec:
return self._spec
async def invoke(self, arguments: dict, context: ToolContext) -> ToolResult:
return ToolResult(success=True, content=f"executed:{arguments}", tool_name=self.spec.name)
def _executor(*tools: FakeTool) -> ReplayToolExecutor:
registry = ToolRegistry()
for tool in tools:
registry.register(tool)
return ReplayToolExecutor(ToolExecutor(registry), registry=registry, policy=ReplayToolPolicy())
def test_classify_tool_modes_from_spec() -> None:
assert classify_tool_mode(FakeTool("read_file").spec) == "executed"
assert classify_tool_mode(FakeTool("write_file").spec) == "executed"
assert classify_tool_mode(FakeTool("mcp_outlook_send_email", toolset="mcp", metadata={"transport": "mcp"}).spec) == "surrogate"
assert classify_tool_mode(FakeTool("delete_account", toolset="mcp", metadata={"transport": "mcp"}).spec) == "blocked"
def test_replay_executor_executes_safe_tool_and_records_trace() -> None:
executor = _executor(FakeTool("write_file"))
result = asyncio.run(executor.execute("write_file", {"path": "a.txt"}, context=ToolContext(workspace="/tmp/replay")))
assert result.success is True
assert result.content.startswith("executed:")
assert executor.traces[0]["mode"] == "executed"
assert executor.traces[0]["tool_name"] == "write_file"
def test_replay_executor_surrogates_external_write_and_blocks_destructive() -> None:
executor = _executor(
FakeTool("mcp_outlook_send_email", toolset="mcp", metadata={"transport": "mcp"}),
FakeTool("delete_account", toolset="mcp", metadata={"transport": "mcp"}),
)
send = asyncio.run(executor.execute("mcp_outlook_send_email", {"to": "ada@example.com"}, context=ToolContext()))
delete = asyncio.run(executor.execute("delete_account", {"id": "1"}, context=ToolContext()))
assert send.success is True
assert send.error == "replay_surrogate"
assert delete.success is False
assert delete.error == "replay_blocked"
assert [trace["mode"] for trace in executor.traces] == ["surrogate", "blocked"]
- Step 2: Run tests to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_replay.py -v
Expected: FAIL because replay.py does not exist.
- Step 3: Implement replay policy and executor
Create app-instance/backend/beaver/skills/learning/replay.py:
"""Replay execution helpers for skill draft evaluation."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any, Literal
from uuid import uuid4
from beaver.tools.base import ToolContext, ToolResult, ToolSpec
from beaver.tools.registry.tool_registry import ToolRegistry
from beaver.tools.runtime.executor import ToolExecutor
ToolExecutionMode = Literal["executed", "surrogate", "blocked"]
@dataclass(slots=True)
class ReplayToolPolicy:
safe_toolsets: set[str] = field(default_factory=lambda: {"filesystem", "user_files", "core", "web", "search"})
surrogate_transports: set[str] = field(default_factory=lambda: {"mcp", "connector"})
destructive_terms: tuple[str, ...] = (
"delete",
"remove",
"destroy",
"revoke",
"permission",
"credential",
"payment",
"pay",
)
external_write_terms: tuple[str, ...] = (
"send",
"post",
"publish",
"create",
"update",
"invite",
"reply",
"forward",
)
class ReplayToolExecutor:
def __init__(
self,
inner: ToolExecutor,
*,
registry: ToolRegistry,
policy: ReplayToolPolicy | None = None,
) -> None:
self.inner = inner
self.registry = registry
self.policy = policy or ReplayToolPolicy()
self.traces: list[dict[str, Any]] = []
async def execute(
self,
tool_name: str,
arguments: dict[str, Any] | None,
*,
context: ToolContext | None = None,
) -> ToolResult:
tool = self.registry.get(tool_name)
spec = tool.spec if tool is not None else ToolSpec(
name=tool_name,
description="unregistered tool",
input_schema={"type": "object", "properties": {}},
toolset="unknown",
)
mode = classify_tool_mode(spec, self.policy)
trace = {
"trace_id": uuid4().hex,
"tool_name": tool_name,
"mode": mode,
"arguments": dict(arguments or {}),
"schema": dict(spec.input_schema),
"toolset": spec.toolset,
"metadata": dict(spec.metadata),
"classification_reason": _classification_reason(spec, mode),
}
if mode == "executed":
result = await self.inner.execute(tool_name, arguments or {}, context=context)
trace["result"] = {"success": result.success, "error": result.error, "content": result.content[:2000]}
self.traces.append(trace)
return result
if mode == "surrogate":
trace["result"] = {"success": True, "error": "replay_surrogate", "content": "Tool call recorded for surrogate evaluation."}
self.traces.append(trace)
return ToolResult(
success=True,
content="Tool call recorded for surrogate evaluation.",
tool_name=tool_name,
error="replay_surrogate",
raw_output=trace,
)
trace["result"] = {"success": False, "error": "replay_blocked", "content": "Tool call blocked by replay policy."}
self.traces.append(trace)
return ToolResult(
success=False,
content="Tool call blocked by replay policy.",
tool_name=tool_name,
error="replay_blocked",
raw_output=trace,
)
async def execute_tool_call(self, tool_call: Any, *, context: ToolContext | None = None) -> ToolResult:
tool_name, arguments = ToolExecutor._normalize_tool_call(tool_call)
return await self.execute(tool_name, arguments, context=context)
def classify_tool_mode(spec: ToolSpec, policy: ReplayToolPolicy | None = None) -> ToolExecutionMode:
policy = policy or ReplayToolPolicy()
name = spec.name.lower()
toolset = spec.toolset.lower()
metadata = {str(key).lower(): str(value).lower() for key, value in spec.metadata.items()}
if any(term in name for term in policy.destructive_terms):
return "blocked"
if toolset in policy.safe_toolsets:
return "executed"
if metadata.get("transport") in policy.surrogate_transports or toolset in {"mcp", "connector", "external"}:
if any(term in name for term in policy.external_write_terms):
return "surrogate"
return "executed"
return "surrogate"
def _classification_reason(spec: ToolSpec, mode: ToolExecutionMode) -> str:
return f"{spec.name} classified as {mode} from toolset={spec.toolset} metadata={spec.metadata}"
Export ReplayToolExecutor, ReplayToolPolicy, and classify_tool_mode from app-instance/backend/beaver/skills/learning/__init__.py.
- Step 4: Run replay policy tests
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_replay.py -v
Expected: PASS.
- Step 5: Commit
git add app-instance/backend/beaver/skills/learning/replay.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_replay.py
git commit -m "feat(skill-learning): add replay tool policy"
Task 6: Add AgentLoop Replay Executor Injection
Files:
-
Modify:
app-instance/backend/beaver/engine/loop.py -
Test:
app-instance/backend/tests/unit/test_agent_loop_replay_executor.py -
Step 1: Write failing replay injection test
Create app-instance/backend/tests/unit/test_agent_loop_replay_executor.py:
from __future__ import annotations
from pathlib import Path
from types import SimpleNamespace
import pytest
from beaver.engine.loop import AgentLoop
from beaver.engine.providers.base import LLMProvider, LLMResponse, ToolCallRequest
from beaver.engine.providers.factory import ProviderBundle
from beaver.skills.learning.replay import ReplayToolExecutor, ReplayToolPolicy
class ToolCallingProvider(LLMProvider):
def __init__(self) -> None:
self.calls = 0
async def chat(self, messages: list[dict], tools: list[dict] | None = None, model: str | None = None, max_tokens: int = 4096, temperature: float = 0.7) -> LLMResponse:
self.calls += 1
if self.calls == 1:
return LLMResponse(
content="",
tool_calls=[
ToolCallRequest(
id="call-1",
name="read_file",
arguments={"path": "README.md"},
)
],
)
return LLMResponse(content="done")
def get_default_model(self) -> str:
return "stub"
@pytest.mark.asyncio
async def test_process_direct_uses_replay_tool_executor(tmp_path: Path) -> None:
loop = AgentLoop(workspace=tmp_path)
loaded = loop.boot()
provider = ToolCallingProvider()
runtime = SimpleNamespace(model="stub", provider_name="stub")
replay_executor = ReplayToolExecutor(
loaded.tool_executor,
registry=loaded.tool_registry,
policy=ReplayToolPolicy(),
)
result = await loop.process_direct(
"Read the README.",
provider_bundle=ProviderBundle(main_runtime=runtime, main_provider=provider), # type: ignore[arg-type]
include_skill_assembly=False,
pinned_skill_names=[],
tool_executor_override=replay_executor,
max_tool_iterations=2,
source="skill_replay_eval",
)
assert result.output_text == "done"
assert replay_executor.traces
assert replay_executor.traces[0]["tool_name"] == "read_file"
- Step 2: Run test to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_agent_loop_replay_executor.py -v
Expected: FAIL because process_direct() does not accept tool_executor_override.
- Step 3: Add optional executor override to AgentLoop
In app-instance/backend/beaver/engine/loop.py, add tool_executor_override: Any = None to process_direct() and _process_direct_impl() keyword parameters. Pass the argument from process_direct() into _process_direct_impl().
After tool_executor = self._require_loaded("tool_executor"), add:
effective_tool_executor = tool_executor_override or tool_executor
Replace this line inside the tool loop:
result = await tool_executor.execute_tool_call(tool_call, context=tool_context)
with:
result = await effective_tool_executor.execute_tool_call(tool_call, context=tool_context)
- Step 4: Run replay injection test
Run:
cd app-instance/backend
pytest tests/unit/test_agent_loop_replay_executor.py -v
Expected: PASS.
- Step 5: Run a direct-loop regression test
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_eval.py::test_eval_provider_unavailable_is_skipped_not_failed -v
Expected: PASS.
- Step 6: Commit
git add app-instance/backend/beaver/engine/loop.py app-instance/backend/tests/unit/test_agent_loop_replay_executor.py
git commit -m "feat(engine): allow replay tool executor injection"
Task 7: Add Surrogate Evaluator
Files:
-
Create:
app-instance/backend/beaver/skills/learning/surrogate.py -
Modify:
app-instance/backend/beaver/skills/learning/__init__.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_surrogate.py -
Step 1: Write failing surrogate tests
Create app-instance/backend/tests/unit/test_skill_learning_surrogate.py:
from __future__ import annotations
import asyncio
from beaver.skills.learning.surrogate import SurrogateToolEvaluator
def test_surrogate_scores_complete_candidate_higher_than_missing_baseline() -> None:
evaluator = SurrogateToolEvaluator()
baseline = {
"arm": "baseline",
"tool_calls": [
{"tool_name": "mcp_outlook_send_email", "mode": "surrogate", "arguments": {"to": "", "subject": ""}},
],
}
candidate = {
"arm": "candidate",
"tool_calls": [
{
"tool_name": "mcp_outlook_send_email",
"mode": "surrogate",
"arguments": {"to": "ada@example.com", "subject": "Status", "body": "Done"},
},
],
}
result = asyncio.run(evaluator.evaluate(task_text="Send a status email to Ada.", baseline=baseline, candidate=candidate))
assert result["candidate_score"] > result["baseline_score"]
assert result["surrogate_tool_count"] == 2
assert result["confidence"] in {"low", "medium"}
- Step 2: Run tests to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_surrogate.py -v
Expected: FAIL because surrogate.py does not exist.
- Step 3: Implement deterministic surrogate v1
Create app-instance/backend/beaver/skills/learning/surrogate.py:
"""Surrogate evaluation for replay tool calls that cannot execute safely."""
from __future__ import annotations
from typing import Any
class SurrogateToolEvaluator:
async def evaluate(self, *, task_text: str, baseline: dict[str, Any], candidate: dict[str, Any]) -> dict[str, Any]:
baseline_score = _score_arm(task_text, baseline)
candidate_score = _score_arm(task_text, candidate)
surrogate_count = _mode_count(baseline, "surrogate") + _mode_count(candidate, "surrogate")
blocked_count = _mode_count(baseline, "blocked") + _mode_count(candidate, "blocked")
confidence = "low" if blocked_count else ("medium" if surrogate_count <= 2 else "low")
return {
"baseline_score": baseline_score,
"candidate_score": candidate_score,
"delta": round(candidate_score - baseline_score, 4),
"surrogate_tool_count": surrogate_count,
"blocked_tool_count": blocked_count,
"confidence": confidence,
"notes": [
"Surrogate score is based on intended tool calls, schemas, arguments, and task relevance.",
],
}
def _score_arm(task_text: str, arm: dict[str, Any]) -> float:
calls = [item for item in arm.get("tool_calls") or [] if isinstance(item, dict)]
if not calls:
return 0.5
scores = [_score_call(task_text, call) for call in calls]
return round(sum(scores) / len(scores), 4)
def _score_call(task_text: str, call: dict[str, Any]) -> float:
if call.get("mode") == "blocked":
return 0.2
if call.get("mode") == "executed":
result = call.get("result") if isinstance(call.get("result"), dict) else {}
return 0.85 if result.get("success") is not False else 0.35
arguments = dict(call.get("arguments") or {})
if not arguments:
return 0.45
non_empty = sum(1 for value in arguments.values() if str(value).strip())
completeness = non_empty / max(1, len(arguments))
argument_text = " ".join(str(value).lower() for value in arguments.values())
relevance = 0.15 if any(token and token in argument_text for token in task_text.lower().split()[:16]) else 0.0
return round(min(0.9, 0.5 + 0.3 * completeness + relevance), 4)
def _mode_count(arm: dict[str, Any], mode: str) -> int:
return sum(1 for item in arm.get("tool_calls") or [] if isinstance(item, dict) and item.get("mode") == mode)
Export SurrogateToolEvaluator from __init__.py.
- Step 4: Run surrogate tests
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_surrogate.py -v
Expected: PASS.
- Step 5: Commit
git add app-instance/backend/beaver/skills/learning/surrogate.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_surrogate.py
git commit -m "feat(skill-learning): add surrogate tool evaluator"
Task 8: Add Replay Arm Runner
Files:
-
Modify:
app-instance/backend/beaver/skills/learning/replay.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_replay_runner.py -
Step 1: Write failing arm runner test
Create app-instance/backend/tests/unit/test_skill_learning_replay_runner.py:
from __future__ import annotations
import asyncio
from types import SimpleNamespace
from beaver.skills.learning.replay import ReplayArmRequest, ReplayRunner
class FakeAgentLoop:
def boot(self):
return SimpleNamespace(tool_executor=SimpleNamespace(), tool_registry=SimpleNamespace(get=lambda name: None))
async def process_direct(self, task: str, **kwargs):
executor = kwargs["tool_executor_override"]
await executor.execute("mcp_outlook_send_email", {"to": "ada@example.com"})
return SimpleNamespace(session_id="session-replay", run_id="run-replay", output_text="done", finish_reason="stop")
def test_replay_runner_returns_arm_report_with_tool_trace() -> None:
runner = ReplayRunner(agent_loop=FakeAgentLoop())
request = ReplayArmRequest(
case_id="case-1",
arm="candidate",
task_text="Send a status email to Ada.",
pinned_skill_names=[],
pinned_skill_contexts=[],
provider_bundle=object(),
model_settings={"max_tool_iterations": 2},
)
report = asyncio.run(runner.run_arm(request))
assert report["case_id"] == "case-1"
assert report["arm"] == "candidate"
assert report["finish_reason"] == "stop"
assert report["tool_calls"][0]["tool_name"] == "mcp_outlook_send_email"
- Step 2: Run test to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_replay_runner.py -v
Expected: FAIL because ReplayRunner and ReplayArmRequest are not implemented.
- Step 3: Add arm request and runner
Append to app-instance/backend/beaver/skills/learning/replay.py:
@dataclass(slots=True)
class ReplayArmRequest:
case_id: str
arm: str
task_text: str
pinned_skill_names: list[str] = field(default_factory=list)
pinned_skill_contexts: list[Any] = field(default_factory=list)
provider_bundle: Any | None = None
model_settings: dict[str, Any] = field(default_factory=dict)
class ReplayRunner:
def __init__(self, *, agent_loop: Any, policy: ReplayToolPolicy | None = None) -> None:
self.agent_loop = agent_loop
self.policy = policy or ReplayToolPolicy()
async def run_arm(self, request: ReplayArmRequest) -> dict[str, Any]:
loaded = self.agent_loop.boot()
replay_executor = ReplayToolExecutor(
loaded.tool_executor,
registry=loaded.tool_registry,
policy=self.policy,
)
result = await self.agent_loop.process_direct(
request.task_text,
provider_bundle=request.provider_bundle,
include_skill_assembly=False,
include_tools=True,
pinned_skill_names=request.pinned_skill_names,
pinned_skill_contexts=request.pinned_skill_contexts,
max_tool_iterations=int(request.model_settings.get("max_tool_iterations") or 4),
temperature=float(request.model_settings.get("temperature") or 0.0),
source="skill_replay_eval",
tool_executor_override=replay_executor,
)
return {
"case_id": request.case_id,
"arm": request.arm,
"session_id": result.session_id,
"run_id": result.run_id,
"task_text": request.task_text,
"finish_reason": result.finish_reason,
"final_answer": result.output_text,
"tool_calls": list(replay_executor.traces),
"artifacts": [],
"side_effects": _side_effects_from_traces(replay_executor.traces),
}
def _side_effects_from_traces(traces: list[dict[str, Any]]) -> list[dict[str, Any]]:
effects: list[dict[str, Any]] = []
for trace in traces:
if trace.get("mode") in {"surrogate", "blocked"}:
effects.append(
{
"tool_name": trace.get("tool_name"),
"mode": trace.get("mode"),
"arguments": trace.get("arguments"),
"classification_reason": trace.get("classification_reason"),
}
)
return effects
Export ReplayRunner and ReplayArmRequest from __init__.py.
- Step 4: Run arm runner test
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_replay_runner.py -v
Expected: PASS.
- Step 5: Commit
git add app-instance/backend/beaver/skills/learning/replay.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_replay_runner.py
git commit -m "feat(skill-learning): run replay arms through agent loop"
Task 9: Orchestrate Replay Eval Reports
Files:
-
Modify:
app-instance/backend/beaver/skills/learning/eval.py -
Modify:
app-instance/backend/beaver/skills/learning/pipeline.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_eval.py -
Step 1: Add failing replay eval test
Append to app-instance/backend/tests/unit/test_skill_learning_eval.py:
class FakeReplayRunner:
async def run_arm(self, request):
return {
"case_id": request.case_id,
"arm": request.arm,
"session_id": "session-replay",
"run_id": f"{request.arm}-run",
"task_text": request.task_text,
"finish_reason": "stop",
"final_answer": "done",
"tool_calls": [
{
"tool_name": "write_file",
"mode": "executed",
"arguments": {"path": "README.md"},
"result": {"success": True, "content": "ok"},
}
],
"artifacts": [],
"side_effects": [],
}
def test_eval_report_includes_replay_case_and_coverage(tmp_path: Path) -> None:
pipeline = _pipeline(tmp_path)
draft = pipeline.draft_service.create_new_skill_draft(
skill_name="release-checklist",
proposed_content="# Release\n\nRun tests.",
proposed_frontmatter={"description": "release", "tools": []},
created_by="test",
reason="test",
)
pipeline.learning_store.update_learning_candidate(
"candidate-1",
draft_skill_name=draft.skill_name,
draft_id=draft.draft_id,
)
report = asyncio.run(
pipeline.evaluate_draft(
"candidate-1",
draft.skill_name,
draft.draft_id,
provider_bundle=_bundle(),
replay_runner=FakeReplayRunner(),
)
)
assert report.mode == "replay"
assert report.eval_version == "replay-v1"
assert report.case_reports
assert 0.0 <= report.execution_coverage <= 1.0
assert 0.0 <= report.surrogate_coverage <= 1.0
assert report.confidence in {"low", "medium", "high"}
- Step 2: Run the new test to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_eval.py::test_eval_report_includes_replay_case_and_coverage -v
Expected: FAIL because evaluator still returns mode="heuristic".
- Step 3: Update evaluator imports
In app-instance/backend/beaver/skills/learning/eval.py, import:
from beaver.engine.context.builder import SkillContext
from beaver.skills.learning.case_selection import select_replay_cases
from beaver.skills.learning.replay import ReplayArmRequest, ReplayRunner
from beaver.skills.learning.surrogate import SurrogateToolEvaluator
from beaver.skills.learning.preservation import check_preservation
- Step 4: Add optional replay runner dependency
Update SkillDraftEvaluator.__init__():
def __init__(
self,
run_store: RunMemoryStore,
*,
surrogate_evaluator: SurrogateToolEvaluator | None = None,
) -> None:
self.run_store = run_store
self.surrogate_evaluator = surrogate_evaluator or SurrogateToolEvaluator()
Update evaluate() to accept the runner explicitly:
async def evaluate(
self,
candidate: SkillLearningCandidate,
draft: SkillDraft,
*,
provider_bundle: ProviderBundle | None = None,
replay_runner: ReplayRunner | None = None,
) -> SkillDraftEvalReport:
Keep the existing provider unavailable branch. If provider_bundle is absent, return the current skipped-provider report. If replay_runner is absent, use the extracted heuristic evaluator so tests and workers without a loop do not crash.
- Step 5: Pass replay runner through pipeline
In app-instance/backend/beaver/skills/learning/pipeline.py, update evaluate_draft():
async def evaluate_draft(
self,
candidate_id: str,
skill_name: str,
draft_id: str,
*,
provider_bundle: ProviderBundle | None = None,
replay_runner: ReplayRunner | None = None,
) -> SkillDraftEvalReport:
Pass replay_runner=replay_runner to evaluator.evaluate(...).
- Step 6: Build replay cases in evaluate()
Inside evaluate(), replace the heuristic-only case loop with:
runs = self.run_store.list_runs()
replay_cases = select_replay_cases(candidate, runs)
if replay_runner is not None and replay_cases:
return await self._evaluate_replay(
candidate=candidate,
draft=draft,
replay_cases=replay_cases,
provider_bundle=provider_bundle,
replay_runner=replay_runner,
)
Keep the existing heuristic body as _evaluate_heuristic() and call it when no replay cases exist.
- Step 7: Add
_evaluate_replay()
Add method:
async def _evaluate_replay(
self,
*,
candidate: SkillLearningCandidate,
draft: SkillDraft,
replay_cases: list[dict],
provider_bundle: ProviderBundle,
replay_runner: ReplayRunner,
) -> SkillDraftEvalReport:
case_reports: list[dict] = []
legacy_cases: list[dict] = []
for case in replay_cases:
baseline = await replay_runner.run_arm(
ReplayArmRequest(
case_id=f"{case['run_id']}:baseline",
arm="baseline",
task_text=str(case["task_text"]),
pinned_skill_names=list(case.get("baseline_skill_names") or []),
pinned_skill_contexts=[],
provider_bundle=provider_bundle,
model_settings={"max_tool_iterations": 4, "temperature": 0.0},
)
)
candidate_arm = await replay_runner.run_arm(
ReplayArmRequest(
case_id=f"{case['run_id']}:candidate",
arm="candidate",
task_text=str(case["task_text"]),
pinned_skill_names=[],
pinned_skill_contexts=[_draft_skill_context(draft)],
provider_bundle=provider_bundle,
model_settings={"max_tool_iterations": 4, "temperature": 0.0},
)
)
surrogate = await self.surrogate_evaluator.evaluate(
task_text=str(case["task_text"]),
baseline=baseline,
candidate=candidate_arm,
)
baseline_score = surrogate["baseline_score"]
candidate_score = surrogate["candidate_score"]
case_report = {
"run_id": case["run_id"],
"task_id": case.get("task_id"),
"session_id": case.get("session_id"),
"baseline": baseline,
"candidate": candidate_arm,
"baseline_score": baseline_score,
"candidate_score": candidate_score,
"delta": round(candidate_score - baseline_score, 4),
"confidence": surrogate["confidence"],
"validator_notes": list(surrogate.get("notes") or []),
}
case_reports.append(case_report)
legacy_cases.append(
{
"run_id": case["run_id"],
"session_id": case.get("session_id") or "",
"baseline_score": baseline_score,
"candidate_score": candidate_score,
"delta": round(candidate_score - baseline_score, 4),
}
)
preservation_report = _preservation_report(candidate, draft)
return _report_from_case_reports(candidate, draft, case_reports, legacy_cases, preservation_report)
- Step 8: Add helper functions
Add module-level helpers in eval.py:
def _draft_skill_context(draft: SkillDraft) -> SkillContext:
tool_hints = draft.proposed_frontmatter.get("tools")
return SkillContext(
name=f"draft:{draft.skill_name}",
content=draft.proposed_content,
version=draft.draft_id,
content_hash="draft",
activation_reason="skill_replay_eval_candidate",
tool_hints=[str(item) for item in tool_hints if str(item).strip()] if isinstance(tool_hints, list) else [],
)
def _preservation_report(candidate: SkillLearningCandidate, draft: SkillDraft) -> dict | None:
if candidate.kind not in {"revise_skill", "merge_skills"}:
return None
base_content = str(candidate.evidence.get("base_content") or "") if isinstance(candidate.evidence, dict) else ""
if not base_content.strip():
return None
return check_preservation(base_content=base_content, draft_content=draft.proposed_content)
def _report_from_case_reports(
candidate: SkillLearningCandidate,
draft: SkillDraft,
case_reports: list[dict],
legacy_cases: list[dict],
preservation_report: dict | None,
) -> SkillDraftEvalReport:
baseline_avg = sum(item["baseline_score"] for item in legacy_cases) / len(legacy_cases)
candidate_avg = sum(item["candidate_score"] for item in legacy_cases) / len(legacy_cases)
regressions = [item for item in legacy_cases if item["candidate_score"] < item["baseline_score"]]
improved = [item for item in legacy_cases if item["candidate_score"] > item["baseline_score"]]
unchanged = len(legacy_cases) - len(regressions) - len(improved)
execution, surrogate, blocked = _coverage(case_reports)
confidence = _confidence(execution, surrogate, blocked, [item.get("confidence") for item in case_reports])
score_delta = candidate_avg - baseline_avg
passed = candidate_avg >= 0.75 and not (regressions and score_delta <= 0) and blocked < 1.0
return SkillDraftEvalReport(
report_id=uuid4().hex,
skill_name=draft.skill_name,
draft_id=draft.draft_id,
candidate_id=candidate.candidate_id,
passed=passed,
baseline_score_avg=round(baseline_avg, 4),
candidate_score_avg=round(candidate_avg, 4),
score_delta=round(score_delta, 4),
regression_count=len(regressions),
improved_count=len(improved),
unchanged_count=unchanged,
cases=legacy_cases,
status="completed",
created_at=_utc_now(),
eval_version="replay-v1",
mode="replay",
execution_coverage=execution,
surrogate_coverage=surrogate,
blocked_coverage=blocked,
confidence=confidence,
case_reports=case_reports,
tool_mode_summary={"executed": execution, "surrogate": surrogate, "blocked": blocked},
preservation_report=preservation_report,
)
Add _coverage() and _confidence():
def _coverage(case_reports: list[dict]) -> tuple[float, float, float]:
counts = {"executed": 0, "surrogate": 0, "blocked": 0}
for report in case_reports:
for arm_name in ("baseline", "candidate"):
arm = report.get(arm_name) if isinstance(report.get(arm_name), dict) else {}
for call in arm.get("tool_calls") or []:
if isinstance(call, dict) and call.get("mode") in counts:
counts[str(call["mode"])] += 1
total = sum(counts.values())
if total == 0:
return 1.0, 0.0, 0.0
return (
round(counts["executed"] / total, 4),
round(counts["surrogate"] / total, 4),
round(counts["blocked"] / total, 4),
)
def _confidence(execution: float, surrogate: float, blocked: float, case_confidences: list[object]) -> str:
if blocked > 0.0:
return "low"
if execution >= 0.75 and surrogate <= 0.25:
return "high"
if execution >= 0.25 or "medium" in case_confidences:
return "medium"
return "low"
- Step 9: Run eval test
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_eval.py::test_eval_report_includes_replay_case_and_coverage -v
Expected: PASS.
- Step 10: Run all skill learning backend tests touched so far
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_eval_report_model.py tests/unit/test_skill_learning_case_selection.py tests/unit/test_skill_learning_preservation.py tests/unit/test_skill_learning_replay.py tests/unit/test_skill_learning_replay_runner.py tests/unit/test_skill_learning_surrogate.py tests/unit/test_skill_learning_eval.py -v
Expected: PASS.
- Step 11: Commit
git add app-instance/backend/beaver/skills/learning/eval.py app-instance/backend/beaver/skills/learning/pipeline.py app-instance/backend/tests/unit/test_skill_learning_eval.py
git commit -m "feat(skill-learning): produce replay eval reports"
Task 10: Apply Preservation Report During Eval And Publish Gates
Files:
-
Modify:
app-instance/backend/beaver/skills/learning/eval.py -
Modify:
app-instance/backend/beaver/skills/learning/pipeline.py -
Test:
app-instance/backend/tests/unit/test_skill_learning_pipeline.py -
Step 1: Write failing publish gate test
Append to app-instance/backend/tests/unit/test_skill_learning_pipeline.py:
def test_publish_blocks_low_confidence_replay_report(tmp_path: Path) -> None:
pipeline = _pipeline(tmp_path)
draft = pipeline.draft_service.create_new_skill_draft(
skill_name="low-confidence",
proposed_content="# Low\n\nDo it.",
proposed_frontmatter={"description": "low", "tools": ["mcp_outlook_send_email"]},
created_by="test",
reason="test",
)
pipeline.learning_store.write_eval_report(
SkillDraftEvalReport(
report_id="eval-low",
skill_name=draft.skill_name,
draft_id=draft.draft_id,
candidate_id="candidate-1",
passed=True,
baseline_score_avg=0.7,
candidate_score_avg=0.9,
score_delta=0.2,
regression_count=0,
improved_count=1,
unchanged_count=0,
confidence="low",
mode="replay",
eval_version="replay-v1",
execution_coverage=0.0,
surrogate_coverage=1.0,
blocked_coverage=0.0,
)
)
pipeline.check_safety(draft.skill_name, draft.draft_id)
pipeline.submit_review(draft.skill_name, draft.draft_id, requested_by="tester")
pipeline.approve(draft.skill_name, draft.draft_id, reviewer="tester")
with pytest.raises(ValueError, match="low confidence"):
pipeline.publish(draft.skill_name, draft.draft_id, publisher="tester")
Add import if missing:
from beaver.memory.skills import SkillDraftEvalReport
- Step 2: Run test to verify failure
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_pipeline.py::test_publish_blocks_low_confidence_replay_report -v
Expected: FAIL because publish gate does not check confidence.
- Step 3: Update publish gates
In SkillLearningPipelineService._validate_publish_gates(), after the existing eval pass check, add:
if eval_report is not None and eval_report.mode == "replay":
if eval_report.confidence == "low":
raise ValueError("Draft replay eval has low confidence and requires revision before publish")
if eval_report.blocked_coverage >= 1.0:
raise ValueError("Draft replay eval blocked all important tool calls")
preservation = eval_report.preservation_report or {}
if preservation.get("passed") is False:
raise ValueError("Draft preservation check did not pass")
- Step 4: Add preservation gate test
Append to app-instance/backend/tests/unit/test_skill_learning_pipeline.py:
def test_publish_blocks_failed_preservation_report(tmp_path: Path) -> None:
pipeline = _pipeline(tmp_path)
draft = pipeline.draft_service.create_new_skill_draft(
skill_name="dropped-section",
proposed_content="# Skill\n\n## Workflow\n\nDo it.",
proposed_frontmatter={"description": "dropped", "tools": []},
created_by="test",
reason="test",
)
pipeline.learning_store.write_eval_report(
SkillDraftEvalReport(
report_id="eval-preservation",
skill_name=draft.skill_name,
draft_id=draft.draft_id,
candidate_id="candidate-1",
passed=True,
baseline_score_avg=0.7,
candidate_score_avg=0.9,
score_delta=0.2,
regression_count=0,
improved_count=1,
unchanged_count=0,
confidence="medium",
mode="replay",
eval_version="replay-v1",
preservation_report={"passed": False, "risk_level": "high", "dropped_sections": ["Safety"]},
)
)
pipeline.check_safety(draft.skill_name, draft.draft_id)
pipeline.submit_review(draft.skill_name, draft.draft_id, requested_by="tester")
pipeline.approve(draft.skill_name, draft.draft_id, reviewer="tester")
with pytest.raises(ValueError, match="preservation"):
pipeline.publish(draft.skill_name, draft.draft_id, publisher="tester")
- Step 5: Run pipeline tests
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_pipeline.py::test_publish_blocks_low_confidence_replay_report tests/unit/test_skill_learning_pipeline.py::test_publish_blocks_failed_preservation_report -v
Expected: PASS.
- Step 6: Run pipeline suite
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_pipeline.py tests/unit/test_skill_learning_eval.py -v
Expected: PASS.
- Step 7: Commit
git add app-instance/backend/beaver/skills/learning/eval.py app-instance/backend/beaver/skills/learning/pipeline.py app-instance/backend/tests/unit/test_skill_learning_pipeline.py
git commit -m "feat(skill-learning): gate publish on replay confidence"
Task 11: Update Web API Types And Skills UI
Files:
-
Modify:
app-instance/backend/beaver/interfaces/web/app.py -
Modify:
app-instance/frontend/types/index.ts -
Modify:
app-instance/frontend/app/(app)/skills/page.tsx -
Step 1: Inject replay runner in Web API draft evaluation
In app-instance/backend/beaver/interfaces/web/app.py, import:
from beaver.skills.learning.replay import ReplayRunner
In both synthesize_skill_draft() and regenerate_skill_draft(), keep the loop object instead of only the loaded result:
loop = agent_service.create_loop()
loaded = loop.boot()
Update each evaluate_draft(...) call:
await loaded.skill_learning_pipeline.evaluate_draft( # type: ignore[union-attr]
candidate_id,
draft.skill_name,
draft.draft_id,
provider_bundle=provider_bundle,
replay_runner=ReplayRunner(agent_loop=loop),
)
This makes the normal UI path use real replay arms through the isolated skill_replay_eval source.
- Step 2: Update TypeScript types
In app-instance/frontend/types/index.ts, extend SkillDraftEvalReport:
export interface SkillDraftEvalReport {
report_id: string;
skill_name: string;
draft_id: string;
candidate_id: string;
passed: boolean;
baseline_score_avg: number;
candidate_score_avg: number;
score_delta: number;
regression_count: number;
improved_count: number;
unchanged_count: number;
cases: Array<Record<string, unknown>>;
status: string;
created_at: string;
eval_version?: string;
mode?: 'heuristic' | 'replay' | string;
execution_coverage?: number;
surrogate_coverage?: number;
blocked_coverage?: number;
confidence?: 'low' | 'medium' | 'high' | string;
case_reports?: Array<Record<string, unknown>>;
tool_mode_summary?: Record<string, unknown>;
preservation_report?: Record<string, unknown> | null;
}
- Step 3: Add UI metric tiles
In EvalReportPanel in app-instance/frontend/app/(app)/skills/page.tsx, add metric tiles after Delta:
<MetricTile label={t('执行覆盖', 'Execution')} value={formatPercent(report.execution_coverage)} />
<MetricTile label={t('替代评估', 'Surrogate')} value={formatPercent(report.surrogate_coverage)} />
<MetricTile label={t('置信度', 'Confidence')} value={report.confidence || 'low'} />
Add helper near formatScore:
function formatPercent(value?: number | null): string {
if (typeof value !== 'number' || Number.isNaN(value)) return '0%';
return `${Math.round(value * 100)}%`;
}
- Step 4: Add replay details sections
Inside EvalReportPanel, after existing replay cases table, add:
{Array.isArray(report.case_reports) && report.case_reports.length > 0 ? (
<RawDetails title={t('Replay case reports', 'Replay case reports')} value={report.case_reports} />
) : null}
{report.preservation_report ? (
<RawDetails title={t('Preservation report', 'Preservation report')} value={report.preservation_report} />
) : null}
- Step 5: Run backend web API test
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_web_api.py -v
Expected: PASS.
- Step 6: Run frontend type check or lint
Run:
cd app-instance/frontend
npm run lint
Expected: PASS. If this repo does not define lint, run:
cd app-instance/frontend
npm test -- --runInBand
Expected: PASS or existing unrelated failures documented in the final implementation report.
- Step 7: Commit
git add app-instance/backend/beaver/interfaces/web/app.py app-instance/frontend/types/index.ts app-instance/frontend/app/\(app\)/skills/page.tsx
git commit -m "feat(skills-ui): show replay eval coverage"
Task 12: Final Verification
Files:
-
No new source files.
-
Step 1: Run backend skill learning tests
Run:
cd app-instance/backend
pytest tests/unit/test_skill_learning_eval_report_model.py tests/unit/test_skill_learning_case_selection.py tests/unit/test_skill_learning_preservation.py tests/unit/test_skill_learning_replay.py tests/unit/test_skill_learning_surrogate.py tests/unit/test_skill_learning_eval.py tests/unit/test_skill_learning_pipeline.py tests/unit/test_skill_learning_worker.py tests/unit/test_skill_learning_web_api.py -v
Expected: PASS.
- Step 2: Run frontend verification
Run:
cd app-instance/frontend
npm run lint
Expected: PASS. If unavailable, run the closest existing frontend test command and record the exact command.
- Step 3: Inspect git status
Run:
git status --short
Expected: only intentional implementation changes are present; unrelated user changes remain untouched.
- Step 4: Commit any final fixes
If verification required fixes:
git add <changed-files>
git commit -m "fix(skill-learning): stabilize replay eval"
Expected: commit succeeds.