2164 lines
71 KiB
Markdown
2164 lines
71 KiB
Markdown
# Skill Replay Eval Implementation Plan
|
|
|
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
|
|
**Goal:** Replace heuristic-only skill draft evaluation with replay-style reports that cover all tools through either safe execution or LLM surrogate judgment, while preserving base skill content during draft synthesis.
|
|
|
|
**Architecture:** Extend the existing skill learning pipeline instead of replacing it. Add replay report fields to `SkillDraftEvalReport`, introduce focused helper modules under `beaver/skills/learning/`, then wire the enhanced evaluator through the existing `SkillLearningPipelineService.evaluate_draft()` path and Skills UI.
|
|
|
|
**Tech Stack:** Python dataclasses, existing file-backed stores, pytest, FastAPI response payloads, Next.js/TypeScript Skills page components.
|
|
|
|
---
|
|
|
|
## File Structure
|
|
|
|
Create focused backend modules:
|
|
|
|
- `app-instance/backend/beaver/skills/learning/case_selection.py` selects up to 10 historical replay cases.
|
|
- `app-instance/backend/beaver/skills/learning/preservation.py` builds base skill snapshots and checks draft preservation.
|
|
- `app-instance/backend/beaver/skills/learning/replay.py` defines replay tool policy, trace executor, arm reports, and the AgentLoop-backed replay runner.
|
|
- `app-instance/backend/beaver/skills/learning/surrogate.py` scores non-executed tool calls with an LLM-backed or deterministic fallback surrogate.
|
|
|
|
Modify existing backend modules:
|
|
|
|
- `app-instance/backend/beaver/memory/skills/models.py` extends `SkillDraftEvalReport` with replay fields while preserving old payload compatibility.
|
|
- `app-instance/backend/beaver/skills/learning/eval.py` orchestrates replay eval and keeps heuristic fallback.
|
|
- `app-instance/backend/beaver/skills/learning/synthesizer.py` includes base skill snapshots for revision and merge prompts.
|
|
- `app-instance/backend/beaver/skills/learning/service.py` passes `SkillSpecStore` context into synthesis helpers where needed.
|
|
- `app-instance/backend/beaver/skills/learning/pipeline.py` updates publish gate confidence checks.
|
|
- `app-instance/backend/beaver/engine/loop.py` accepts a replay executor override for isolated tool execution.
|
|
- `app-instance/backend/beaver/interfaces/web/app.py` injects `ReplayRunner(agent_loop=loop)` for draft evaluation requests.
|
|
|
|
Modify frontend:
|
|
|
|
- `app-instance/frontend/types/index.ts` adds replay report fields.
|
|
- `app-instance/frontend/app/(app)/skills/page.tsx` shows execution coverage, surrogate coverage, confidence, replay cases, and preservation details.
|
|
|
|
Add focused tests:
|
|
|
|
- `app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py`
|
|
- `app-instance/backend/tests/unit/test_skill_learning_case_selection.py`
|
|
- `app-instance/backend/tests/unit/test_skill_learning_preservation.py`
|
|
- `app-instance/backend/tests/unit/test_skill_learning_replay.py`
|
|
- `app-instance/backend/tests/unit/test_skill_learning_replay_runner.py`
|
|
- `app-instance/backend/tests/unit/test_agent_loop_replay_executor.py`
|
|
- `app-instance/backend/tests/unit/test_skill_learning_surrogate.py`
|
|
- Extend `app-instance/backend/tests/unit/test_skill_learning_eval.py`
|
|
- Extend `app-instance/backend/tests/unit/test_skill_learning_pipeline.py`
|
|
|
|
---
|
|
|
|
### Task 1: Extend Eval Report Model Compatibly
|
|
|
|
**Files:**
|
|
- Modify: `app-instance/backend/beaver/memory/skills/models.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py`
|
|
|
|
- [ ] **Step 1: Write failing model compatibility tests**
|
|
|
|
Create `app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from beaver.memory.skills import SkillDraftEvalReport
|
|
|
|
|
|
def test_eval_report_defaults_preserve_legacy_payload_shape() -> None:
|
|
report = SkillDraftEvalReport(
|
|
report_id="eval-1",
|
|
skill_name="debug",
|
|
draft_id="draft-1",
|
|
candidate_id="candidate-1",
|
|
passed=True,
|
|
baseline_score_avg=0.5,
|
|
candidate_score_avg=0.8,
|
|
score_delta=0.3,
|
|
regression_count=0,
|
|
improved_count=2,
|
|
unchanged_count=0,
|
|
cases=[{"run_id": "run-1"}],
|
|
status="completed",
|
|
created_at="now",
|
|
)
|
|
|
|
payload = report.to_dict()
|
|
|
|
assert payload["eval_version"] == "heuristic-v1"
|
|
assert payload["mode"] == "heuristic"
|
|
assert payload["execution_coverage"] == 0.0
|
|
assert payload["surrogate_coverage"] == 0.0
|
|
assert payload["blocked_coverage"] == 0.0
|
|
assert payload["confidence"] == "low"
|
|
assert payload["case_reports"] == []
|
|
assert payload["tool_mode_summary"] == {}
|
|
assert payload["preservation_report"] is None
|
|
assert payload["cases"] == [{"run_id": "run-1"}]
|
|
|
|
|
|
def test_eval_report_reads_legacy_payload_without_replay_fields() -> None:
|
|
report = SkillDraftEvalReport.from_dict(
|
|
{
|
|
"report_id": "eval-legacy",
|
|
"skill_name": "debug",
|
|
"draft_id": "draft-1",
|
|
"candidate_id": "candidate-1",
|
|
"passed": True,
|
|
"baseline_score_avg": 0.4,
|
|
"candidate_score_avg": 0.8,
|
|
"score_delta": 0.4,
|
|
"regression_count": 0,
|
|
"improved_count": 1,
|
|
"unchanged_count": 0,
|
|
"cases": [{"run_id": "run-1"}],
|
|
"status": "completed",
|
|
"created_at": "now",
|
|
}
|
|
)
|
|
|
|
assert report.eval_version == "heuristic-v1"
|
|
assert report.mode == "heuristic"
|
|
assert report.confidence == "low"
|
|
assert report.case_reports == []
|
|
```
|
|
|
|
- [ ] **Step 2: Run the new tests to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_eval_report_model.py -v
|
|
```
|
|
|
|
Expected: FAIL because `SkillDraftEvalReport` has no replay fields.
|
|
|
|
- [ ] **Step 3: Extend the dataclass**
|
|
|
|
In `app-instance/backend/beaver/memory/skills/models.py`, add fields to `SkillDraftEvalReport` after `created_at`:
|
|
|
|
```python
|
|
eval_version: str = "heuristic-v1"
|
|
mode: str = "heuristic"
|
|
execution_coverage: float = 0.0
|
|
surrogate_coverage: float = 0.0
|
|
blocked_coverage: float = 0.0
|
|
confidence: str = "low"
|
|
case_reports: list[dict[str, Any]] = field(default_factory=list)
|
|
tool_mode_summary: dict[str, Any] = field(default_factory=dict)
|
|
preservation_report: dict[str, Any] | None = None
|
|
```
|
|
|
|
Update `to_dict()` to include these fields:
|
|
|
|
```python
|
|
"eval_version": self.eval_version,
|
|
"mode": self.mode,
|
|
"execution_coverage": self.execution_coverage,
|
|
"surrogate_coverage": self.surrogate_coverage,
|
|
"blocked_coverage": self.blocked_coverage,
|
|
"confidence": self.confidence,
|
|
"case_reports": [dict(item) for item in self.case_reports],
|
|
"tool_mode_summary": dict(self.tool_mode_summary),
|
|
"preservation_report": dict(self.preservation_report) if self.preservation_report is not None else None,
|
|
```
|
|
|
|
Update `from_dict()` with bounded numeric parsing:
|
|
|
|
```python
|
|
eval_version=str(payload.get("eval_version") or "heuristic-v1"),
|
|
mode=str(payload.get("mode") or "heuristic"),
|
|
execution_coverage=_bounded_float(payload.get("execution_coverage"), default=0.0),
|
|
surrogate_coverage=_bounded_float(payload.get("surrogate_coverage"), default=0.0),
|
|
blocked_coverage=_bounded_float(payload.get("blocked_coverage"), default=0.0),
|
|
confidence=str(payload.get("confidence") or "low"),
|
|
case_reports=[dict(item) for item in payload.get("case_reports") or [] if isinstance(item, dict)],
|
|
tool_mode_summary=dict(payload.get("tool_mode_summary") or {}),
|
|
preservation_report=(
|
|
dict(payload["preservation_report"])
|
|
if isinstance(payload.get("preservation_report"), dict)
|
|
else None
|
|
),
|
|
```
|
|
|
|
Add helper near `_optional_float`:
|
|
|
|
```python
|
|
def _bounded_float(value: Any, *, default: float = 0.0) -> float:
|
|
if value in (None, ""):
|
|
return default
|
|
try:
|
|
return max(0.0, min(1.0, float(value)))
|
|
except (TypeError, ValueError):
|
|
return default
|
|
```
|
|
|
|
- [ ] **Step 4: Run model tests**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_eval_report_model.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 5: Run existing eval store test**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_candidate_state.py::test_reports_are_stored_and_retrieved -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 6: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/memory/skills/models.py app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py
|
|
git commit -m "feat(skill-learning): extend eval report payload"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 2: Add Base Skill Preservation Helpers
|
|
|
|
**Files:**
|
|
- Create: `app-instance/backend/beaver/skills/learning/preservation.py`
|
|
- Modify: `app-instance/backend/beaver/skills/learning/__init__.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_preservation.py`
|
|
|
|
- [ ] **Step 1: Write failing preservation tests**
|
|
|
|
Create `app-instance/backend/tests/unit/test_skill_learning_preservation.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from beaver.skills.learning.preservation import check_preservation
|
|
|
|
|
|
def test_preservation_passes_when_base_sections_remain() -> None:
|
|
base = "# Skill\n\n## Workflow\n\n- Read first.\n\n## Safety\n\n- Do not delete files.\n"
|
|
draft = "# Skill\n\n## Workflow\n\n- Read first.\n- Then write.\n\n## Safety\n\n- Do not delete files.\n"
|
|
|
|
report = check_preservation(base_content=base, draft_content=draft)
|
|
|
|
assert report["passed"] is True
|
|
assert report["risk_level"] == "low"
|
|
assert "Workflow" in report["preserved_sections"]
|
|
assert "Safety" in report["preserved_sections"]
|
|
assert report["dropped_sections"] == []
|
|
|
|
|
|
def test_preservation_flags_dropped_section() -> None:
|
|
base = "# Skill\n\n## Workflow\n\n- Read first.\n\n## Safety\n\n- Do not delete files.\n"
|
|
draft = "# Skill\n\n## Workflow\n\n- Read first.\n"
|
|
|
|
report = check_preservation(base_content=base, draft_content=draft)
|
|
|
|
assert report["passed"] is False
|
|
assert report["risk_level"] == "high"
|
|
assert "Safety" in report["dropped_sections"]
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_preservation.py -v
|
|
```
|
|
|
|
Expected: FAIL because `preservation.py` does not exist.
|
|
|
|
- [ ] **Step 3: Implement preservation helpers**
|
|
|
|
Create `app-instance/backend/beaver/skills/learning/preservation.py`:
|
|
|
|
```python
|
|
"""Preservation checks for skill revision drafts."""
|
|
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
from typing import Any
|
|
|
|
|
|
def check_preservation(*, base_content: str, draft_content: str) -> dict[str, Any]:
|
|
base_sections = _sections(base_content)
|
|
draft_sections = _sections(draft_content)
|
|
preserved: list[str] = []
|
|
changed: list[str] = []
|
|
dropped: list[str] = []
|
|
|
|
for heading, body in base_sections.items():
|
|
draft_body = draft_sections.get(heading)
|
|
if draft_body is None:
|
|
dropped.append(heading)
|
|
continue
|
|
if _normalize(body) == _normalize(draft_body):
|
|
preserved.append(heading)
|
|
else:
|
|
changed.append(heading)
|
|
|
|
risk_level = "high" if dropped else ("medium" if changed else "low")
|
|
return {
|
|
"passed": not dropped,
|
|
"risk_level": risk_level,
|
|
"preserved_sections": preserved,
|
|
"changed_sections": changed,
|
|
"dropped_sections": dropped,
|
|
}
|
|
|
|
|
|
def _sections(content: str) -> dict[str, str]:
|
|
current = "body"
|
|
sections: dict[str, list[str]] = {current: []}
|
|
for line in (content or "").splitlines():
|
|
match = re.match(r"^#{1,6}\s+(.+?)\s*$", line)
|
|
if match:
|
|
current = match.group(1).strip()
|
|
sections.setdefault(current, [])
|
|
continue
|
|
sections.setdefault(current, []).append(line)
|
|
return {heading: "\n".join(lines).strip() for heading, lines in sections.items() if "\n".join(lines).strip()}
|
|
|
|
|
|
def _normalize(value: str) -> str:
|
|
return re.sub(r"\s+", " ", value or "").strip().lower()
|
|
```
|
|
|
|
Export helpers from `app-instance/backend/beaver/skills/learning/__init__.py`:
|
|
|
|
```python
|
|
from .preservation import check_preservation
|
|
```
|
|
|
|
Add `"check_preservation"` to `__all__`.
|
|
|
|
- [ ] **Step 4: Run preservation tests**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_preservation.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/skills/learning/preservation.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_preservation.py
|
|
git commit -m "feat(skill-learning): add draft preservation checks"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 3: Include Base Skill Snapshots During Synthesis
|
|
|
|
**Files:**
|
|
- Modify: `app-instance/backend/beaver/skills/learning/synthesizer.py`
|
|
- Modify: `app-instance/backend/beaver/skills/learning/service.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_synthesizer_preservation.py`
|
|
|
|
- [ ] **Step 1: Write failing prompt test**
|
|
|
|
Create `app-instance/backend/tests/unit/test_skill_learning_synthesizer_preservation.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from beaver.memory.skills import SkillLearningCandidate
|
|
from beaver.skills.learning.evidence import EvidencePacket
|
|
from beaver.skills.learning.synthesizer import SkillDraftSynthesizer
|
|
|
|
|
|
def test_revision_prompt_includes_base_skill_snapshot() -> None:
|
|
candidate = SkillLearningCandidate(
|
|
candidate_id="candidate-1",
|
|
kind="revise_skill",
|
|
source_run_ids=["run-1"],
|
|
source_session_ids=["session-1"],
|
|
related_skill_names=["debug-skill"],
|
|
reason="Improve debugging flow.",
|
|
)
|
|
packet = EvidencePacket(
|
|
run_ids=["run-1"],
|
|
session_ids=["session-1"],
|
|
task_summaries=["debug a failing test"],
|
|
session_excerpts=["assistant: fixed it"],
|
|
)
|
|
prompt = SkillDraftSynthesizer._build_prompt(
|
|
candidate,
|
|
packet,
|
|
"revise",
|
|
base_skill={
|
|
"skill_name": "debug-skill",
|
|
"version": "v0001",
|
|
"frontmatter": {"description": "Debug tests", "tools": ["read_file"]},
|
|
"content": "# Debug Skill\n\n## Safety\n\nDo not delete files.",
|
|
"summary": "Debug tests safely.",
|
|
"tool_hints": ["read_file"],
|
|
},
|
|
)
|
|
|
|
assert "Base skill snapshot" in prompt
|
|
assert "# Debug Skill" in prompt
|
|
assert "Do not delete files." in prompt
|
|
assert "preserved_sections" in prompt
|
|
assert "dropped_sections" in prompt
|
|
```
|
|
|
|
- [ ] **Step 2: Run test to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_synthesizer_preservation.py -v
|
|
```
|
|
|
|
Expected: FAIL because `_build_prompt()` does not accept `base_skill`.
|
|
|
|
- [ ] **Step 3: Update synthesizer signatures**
|
|
|
|
In `app-instance/backend/beaver/skills/learning/synthesizer.py`, change these signatures:
|
|
|
|
```python
|
|
async def synthesize_revision(
|
|
self,
|
|
candidate: SkillLearningCandidate,
|
|
evidence_packet: EvidencePacket,
|
|
provider: LLMProvider,
|
|
model: str,
|
|
base_skill: dict[str, Any] | None = None,
|
|
) -> dict[str, Any]:
|
|
return await self._synthesize(candidate, evidence_packet, provider, model, "revise", base_skill=base_skill)
|
|
```
|
|
|
|
Apply the same optional `base_skill` parameter to `synthesize_merge()` and `_synthesize()`. Keep `synthesize_new_skill()` unchanged except for calling `_synthesize(..., base_skill=None)`.
|
|
|
|
- [ ] **Step 4: Update prompt builder**
|
|
|
|
Update `_build_prompt()` signature:
|
|
|
|
```python
|
|
def _build_prompt(
|
|
candidate: SkillLearningCandidate,
|
|
evidence_packet: EvidencePacket,
|
|
action: str,
|
|
base_skill: dict[str, Any] | None = None,
|
|
) -> str:
|
|
```
|
|
|
|
Before the return, build a base section:
|
|
|
|
```python
|
|
base_section = ""
|
|
if base_skill:
|
|
base_section = (
|
|
"\n\nBase skill snapshot:\n"
|
|
f"- skill_name: {base_skill.get('skill_name')}\n"
|
|
f"- version: {base_skill.get('version')}\n"
|
|
f"- frontmatter: {json.dumps(base_skill.get('frontmatter') or {}, ensure_ascii=False, sort_keys=True)}\n"
|
|
f"- tool_hints: {base_skill.get('tool_hints') or []}\n"
|
|
f"- summary: {base_skill.get('summary') or ''}\n"
|
|
"Base skill content:\n"
|
|
f"{base_skill.get('content') or ''}\n"
|
|
"Preserve existing instructions unless the evidence requires a change. "
|
|
"If any section is changed or dropped, explain it in changed_sections or dropped_sections."
|
|
)
|
|
```
|
|
|
|
Append `base_section` before the JSON instructions. Extend the JSON instruction:
|
|
|
|
```python
|
|
+ "\nThe JSON may include preserved_sections, changed_sections, and dropped_sections arrays."
|
|
```
|
|
|
|
- [ ] **Step 5: Preserve parsed metadata from model output**
|
|
|
|
In `_parse_payload()`, include optional arrays:
|
|
|
|
```python
|
|
"preserved_sections": _coerce_string_list(payload.get("preserved_sections")),
|
|
"changed_sections": _coerce_string_list(payload.get("changed_sections")),
|
|
"dropped_sections": _coerce_string_list(payload.get("dropped_sections")),
|
|
```
|
|
|
|
In `_normalize_payload()`, return those fields as well.
|
|
|
|
- [ ] **Step 6: Add base skill snapshot lookup in service**
|
|
|
|
In `app-instance/backend/beaver/skills/learning/service.py`, import `SkillSpecStore` only if needed by type checking is not enough. The service already has `draft_service.store`; use it to read published skills.
|
|
|
|
Add helper:
|
|
|
|
```python
|
|
def _base_skill_snapshot(self, skill_name: str, version: str | None) -> dict[str, Any] | None:
|
|
loaded = self.draft_service.store.read_published_skill(skill_name, version)
|
|
if loaded is None:
|
|
return None
|
|
return {
|
|
"skill_name": loaded.version.skill_name,
|
|
"version": loaded.version.version,
|
|
"frontmatter": dict(loaded.version.frontmatter),
|
|
"content": loaded.content,
|
|
"summary": loaded.version.summary,
|
|
"tool_hints": list(loaded.version.tool_hints),
|
|
}
|
|
```
|
|
|
|
Pass this snapshot to `synthesize_revision()` and `synthesize_merge()`.
|
|
|
|
- [ ] **Step 7: Run synthesizer test**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_synthesizer_preservation.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 8: Run existing learning tests**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_eval.py tests/unit/test_skill_learning_worker.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 9: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/skills/learning/synthesizer.py app-instance/backend/beaver/skills/learning/service.py app-instance/backend/tests/unit/test_skill_learning_synthesizer_preservation.py
|
|
git commit -m "feat(skill-learning): preserve base skill during synthesis"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 4: Add Historical Replay Case Selection
|
|
|
|
**Files:**
|
|
- Create: `app-instance/backend/beaver/skills/learning/case_selection.py`
|
|
- Modify: `app-instance/backend/beaver/skills/learning/__init__.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_case_selection.py`
|
|
|
|
- [ ] **Step 1: Write failing case selection tests**
|
|
|
|
Create `app-instance/backend/tests/unit/test_skill_learning_case_selection.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from beaver.memory.runs import RunRecord
|
|
from beaver.memory.skills import SkillLearningCandidate
|
|
from beaver.skills.learning.case_selection import select_replay_cases
|
|
from beaver.skills.specs import SkillActivationReceipt
|
|
|
|
|
|
def _run(
|
|
run_id: str,
|
|
*,
|
|
task_id: str = "task",
|
|
session_id: str = "session",
|
|
task_text: str = "debug task",
|
|
skill_name: str | None = None,
|
|
skill_version: str = "v0001",
|
|
) -> RunRecord:
|
|
receipts = []
|
|
if skill_name:
|
|
receipts.append(
|
|
SkillActivationReceipt(
|
|
run_id=run_id,
|
|
session_id=session_id,
|
|
skill_name=skill_name,
|
|
skill_version=skill_version,
|
|
content_hash="hash",
|
|
activated_at="now",
|
|
activation_reason="selected",
|
|
)
|
|
)
|
|
return RunRecord(
|
|
run_id=run_id,
|
|
session_id=session_id,
|
|
task_id=task_id,
|
|
attempt_index=1,
|
|
task_text=task_text,
|
|
started_at=f"2026-06-08T00:00:{run_id[-2:]}+00:00",
|
|
ended_at="end",
|
|
success=True,
|
|
finish_reason="stop",
|
|
feedback={"acceptance_type": "accept"},
|
|
activated_skills=receipts,
|
|
)
|
|
|
|
|
|
def test_select_revise_cases_caps_at_ten_and_prefers_related_skill() -> None:
|
|
runs = [_run(f"run-{index:02d}", task_id=f"task-{index}", skill_name="debug", skill_version="v0001") for index in range(12)]
|
|
candidate = SkillLearningCandidate(
|
|
candidate_id="candidate-1",
|
|
kind="revise_skill",
|
|
source_run_ids=[],
|
|
source_session_ids=[],
|
|
related_skill_names=["debug"],
|
|
reason="revise",
|
|
evidence={"skill_version": "v0001"},
|
|
)
|
|
|
|
cases = select_replay_cases(candidate, runs)
|
|
|
|
assert len(cases) == 10
|
|
assert all(case["baseline_skill_names"] == ["debug"] for case in cases)
|
|
assert cases[0]["run_id"] == "run-11"
|
|
|
|
|
|
def test_select_new_skill_uses_all_available_source_runs_under_ten() -> None:
|
|
runs = [_run(f"run-{index:02d}", task_id=f"task-{index}") for index in range(3)]
|
|
candidate = SkillLearningCandidate(
|
|
candidate_id="candidate-1",
|
|
kind="new_skill",
|
|
source_run_ids=["run-00", "run-01", "run-02"],
|
|
source_session_ids=["session"],
|
|
related_skill_names=[],
|
|
reason="new",
|
|
)
|
|
|
|
cases = select_replay_cases(candidate, runs)
|
|
|
|
assert [case["run_id"] for case in cases] == ["run-02", "run-01", "run-00"]
|
|
assert all(case["baseline_skill_names"] == [] for case in cases)
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_case_selection.py -v
|
|
```
|
|
|
|
Expected: FAIL because `case_selection.py` does not exist.
|
|
|
|
- [ ] **Step 3: Implement selector**
|
|
|
|
Create `app-instance/backend/beaver/skills/learning/case_selection.py`:
|
|
|
|
```python
|
|
"""Historical replay case selection for skill draft evaluation."""
|
|
|
|
from __future__ import annotations
|
|
|
|
from typing import Any
|
|
|
|
from beaver.memory.runs import RunRecord
|
|
from beaver.memory.skills import SkillLearningCandidate
|
|
|
|
MAX_REPLAY_CASES = 10
|
|
|
|
|
|
def select_replay_cases(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[dict[str, Any]]:
|
|
accepted = [record for record in runs if _is_accepted(record)]
|
|
if candidate.kind == "revise_skill":
|
|
selected = _select_revise(candidate, accepted)
|
|
elif candidate.kind == "merge_skills":
|
|
selected = _select_merge(candidate, accepted)
|
|
else:
|
|
selected = _select_new(candidate, accepted)
|
|
return [_case_payload(candidate, record) for record in selected[:MAX_REPLAY_CASES]]
|
|
|
|
|
|
def _select_revise(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
|
|
target = candidate.related_skill_names[0] if candidate.related_skill_names else ""
|
|
version = str(candidate.evidence.get("skill_version") or "")
|
|
matches = [
|
|
record for record in runs
|
|
if any(receipt.skill_name == target and (not version or receipt.skill_version == version) for receipt in record.activated_skills)
|
|
]
|
|
return _recent_diverse(matches)
|
|
|
|
|
|
def _select_merge(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
|
|
targets = set(candidate.related_skill_names)
|
|
matches = [
|
|
record for record in runs
|
|
if targets and targets.issubset({receipt.skill_name for receipt in record.activated_skills})
|
|
]
|
|
return _recent_diverse(matches)
|
|
|
|
|
|
def _select_new(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
|
|
source_ids = set(candidate.source_run_ids)
|
|
if source_ids:
|
|
matches = [record for record in runs if record.run_id in source_ids]
|
|
else:
|
|
theme = str(candidate.evidence.get("theme") or "").lower().strip()
|
|
matches = [record for record in runs if theme and theme in record.task_text.lower()]
|
|
return _recent_diverse(matches)
|
|
|
|
|
|
def _case_payload(candidate: SkillLearningCandidate, record: RunRecord) -> dict[str, Any]:
|
|
baseline_skill_names = []
|
|
if candidate.kind == "revise_skill":
|
|
baseline_skill_names = list(candidate.related_skill_names[:1])
|
|
elif candidate.kind == "merge_skills":
|
|
baseline_skill_names = list(candidate.related_skill_names)
|
|
return {
|
|
"run_id": record.run_id,
|
|
"task_id": record.task_id,
|
|
"session_id": record.session_id,
|
|
"task_text": record.task_text,
|
|
"baseline_skill_names": baseline_skill_names,
|
|
"candidate_skill_name": candidate.draft_skill_name,
|
|
"accepted_score": _score(record),
|
|
}
|
|
|
|
|
|
def _recent_diverse(runs: list[RunRecord]) -> list[RunRecord]:
|
|
sorted_runs = sorted(runs, key=lambda item: (item.started_at, item.run_id), reverse=True)
|
|
result: list[RunRecord] = []
|
|
seen_tasks: set[str] = set()
|
|
for record in sorted_runs:
|
|
task_key = record.task_id or record.task_text
|
|
if task_key in seen_tasks and len(sorted_runs) > MAX_REPLAY_CASES:
|
|
continue
|
|
seen_tasks.add(task_key)
|
|
result.append(record)
|
|
if len(result) >= MAX_REPLAY_CASES:
|
|
break
|
|
if len(result) < min(len(sorted_runs), MAX_REPLAY_CASES):
|
|
seen_run_ids = {record.run_id for record in result}
|
|
result.extend(record for record in sorted_runs if record.run_id not in seen_run_ids)
|
|
return result[:MAX_REPLAY_CASES]
|
|
|
|
|
|
def _is_accepted(record: RunRecord) -> bool:
|
|
feedback = record.feedback or {}
|
|
acceptance = feedback.get("acceptance_type")
|
|
if acceptance is None and feedback.get("feedback_type") == "satisfied":
|
|
acceptance = "accept"
|
|
return bool(record.success) and acceptance == "accept"
|
|
|
|
|
|
def _score(record: RunRecord) -> float:
|
|
validation = record.validation_result or {}
|
|
value = validation.get("score") if isinstance(validation, dict) else None
|
|
if value is not None:
|
|
try:
|
|
return max(0.0, min(1.0, float(value)))
|
|
except (TypeError, ValueError):
|
|
pass
|
|
return 0.8 if record.success else 0.4
|
|
```
|
|
|
|
Export `select_replay_cases` from `app-instance/backend/beaver/skills/learning/__init__.py`.
|
|
|
|
- [ ] **Step 4: Run selector tests**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_case_selection.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/skills/learning/case_selection.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_case_selection.py
|
|
git commit -m "feat(skill-learning): select replay eval cases"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 5: Add Replay Tool Policy And Trace Executor
|
|
|
|
**Files:**
|
|
- Create: `app-instance/backend/beaver/skills/learning/replay.py`
|
|
- Modify: `app-instance/backend/beaver/skills/learning/__init__.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_replay.py`
|
|
|
|
- [ ] **Step 1: Write failing replay policy tests**
|
|
|
|
Create `app-instance/backend/tests/unit/test_skill_learning_replay.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
import asyncio
|
|
|
|
from beaver.tools.base import BaseTool, ToolContext, ToolResult, ToolSpec
|
|
from beaver.tools.registry.tool_registry import ToolRegistry
|
|
from beaver.tools.runtime.executor import ToolExecutor
|
|
from beaver.skills.learning.replay import ReplayToolExecutor, ReplayToolPolicy, classify_tool_mode
|
|
|
|
|
|
class FakeTool(BaseTool):
|
|
def __init__(self, name: str, *, toolset: str = "filesystem", metadata: dict | None = None) -> None:
|
|
self._spec = ToolSpec(
|
|
name=name,
|
|
description=f"{name} tool",
|
|
input_schema={"type": "object", "properties": {"path": {"type": "string"}}},
|
|
toolset=toolset,
|
|
metadata=metadata or {},
|
|
)
|
|
|
|
@property
|
|
def spec(self) -> ToolSpec:
|
|
return self._spec
|
|
|
|
async def invoke(self, arguments: dict, context: ToolContext) -> ToolResult:
|
|
return ToolResult(success=True, content=f"executed:{arguments}", tool_name=self.spec.name)
|
|
|
|
|
|
def _executor(*tools: FakeTool) -> ReplayToolExecutor:
|
|
registry = ToolRegistry()
|
|
for tool in tools:
|
|
registry.register(tool)
|
|
return ReplayToolExecutor(ToolExecutor(registry), registry=registry, policy=ReplayToolPolicy())
|
|
|
|
|
|
def test_classify_tool_modes_from_spec() -> None:
|
|
assert classify_tool_mode(FakeTool("read_file").spec) == "executed"
|
|
assert classify_tool_mode(FakeTool("write_file").spec) == "executed"
|
|
assert classify_tool_mode(FakeTool("mcp_outlook_send_email", toolset="mcp", metadata={"transport": "mcp"}).spec) == "surrogate"
|
|
assert classify_tool_mode(FakeTool("delete_account", toolset="mcp", metadata={"transport": "mcp"}).spec) == "blocked"
|
|
|
|
|
|
def test_replay_executor_executes_safe_tool_and_records_trace() -> None:
|
|
executor = _executor(FakeTool("write_file"))
|
|
|
|
result = asyncio.run(executor.execute("write_file", {"path": "a.txt"}, context=ToolContext(workspace="/tmp/replay")))
|
|
|
|
assert result.success is True
|
|
assert result.content.startswith("executed:")
|
|
assert executor.traces[0]["mode"] == "executed"
|
|
assert executor.traces[0]["tool_name"] == "write_file"
|
|
|
|
|
|
def test_replay_executor_surrogates_external_write_and_blocks_destructive() -> None:
|
|
executor = _executor(
|
|
FakeTool("mcp_outlook_send_email", toolset="mcp", metadata={"transport": "mcp"}),
|
|
FakeTool("delete_account", toolset="mcp", metadata={"transport": "mcp"}),
|
|
)
|
|
|
|
send = asyncio.run(executor.execute("mcp_outlook_send_email", {"to": "ada@example.com"}, context=ToolContext()))
|
|
delete = asyncio.run(executor.execute("delete_account", {"id": "1"}, context=ToolContext()))
|
|
|
|
assert send.success is True
|
|
assert send.error == "replay_surrogate"
|
|
assert delete.success is False
|
|
assert delete.error == "replay_blocked"
|
|
assert [trace["mode"] for trace in executor.traces] == ["surrogate", "blocked"]
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_replay.py -v
|
|
```
|
|
|
|
Expected: FAIL because `replay.py` does not exist.
|
|
|
|
- [ ] **Step 3: Implement replay policy and executor**
|
|
|
|
Create `app-instance/backend/beaver/skills/learning/replay.py`:
|
|
|
|
```python
|
|
"""Replay execution helpers for skill draft evaluation."""
|
|
|
|
from __future__ import annotations
|
|
|
|
from dataclasses import dataclass, field
|
|
from typing import Any, Literal
|
|
from uuid import uuid4
|
|
|
|
from beaver.tools.base import ToolContext, ToolResult, ToolSpec
|
|
from beaver.tools.registry.tool_registry import ToolRegistry
|
|
from beaver.tools.runtime.executor import ToolExecutor
|
|
|
|
ToolExecutionMode = Literal["executed", "surrogate", "blocked"]
|
|
|
|
|
|
@dataclass(slots=True)
|
|
class ReplayToolPolicy:
|
|
safe_toolsets: set[str] = field(default_factory=lambda: {"filesystem", "user_files", "core", "web", "search"})
|
|
surrogate_transports: set[str] = field(default_factory=lambda: {"mcp", "connector"})
|
|
destructive_terms: tuple[str, ...] = (
|
|
"delete",
|
|
"remove",
|
|
"destroy",
|
|
"revoke",
|
|
"permission",
|
|
"credential",
|
|
"payment",
|
|
"pay",
|
|
)
|
|
external_write_terms: tuple[str, ...] = (
|
|
"send",
|
|
"post",
|
|
"publish",
|
|
"create",
|
|
"update",
|
|
"invite",
|
|
"reply",
|
|
"forward",
|
|
)
|
|
|
|
|
|
class ReplayToolExecutor:
|
|
def __init__(
|
|
self,
|
|
inner: ToolExecutor,
|
|
*,
|
|
registry: ToolRegistry,
|
|
policy: ReplayToolPolicy | None = None,
|
|
) -> None:
|
|
self.inner = inner
|
|
self.registry = registry
|
|
self.policy = policy or ReplayToolPolicy()
|
|
self.traces: list[dict[str, Any]] = []
|
|
|
|
async def execute(
|
|
self,
|
|
tool_name: str,
|
|
arguments: dict[str, Any] | None,
|
|
*,
|
|
context: ToolContext | None = None,
|
|
) -> ToolResult:
|
|
tool = self.registry.get(tool_name)
|
|
spec = tool.spec if tool is not None else ToolSpec(
|
|
name=tool_name,
|
|
description="unregistered tool",
|
|
input_schema={"type": "object", "properties": {}},
|
|
toolset="unknown",
|
|
)
|
|
mode = classify_tool_mode(spec, self.policy)
|
|
trace = {
|
|
"trace_id": uuid4().hex,
|
|
"tool_name": tool_name,
|
|
"mode": mode,
|
|
"arguments": dict(arguments or {}),
|
|
"schema": dict(spec.input_schema),
|
|
"toolset": spec.toolset,
|
|
"metadata": dict(spec.metadata),
|
|
"classification_reason": _classification_reason(spec, mode),
|
|
}
|
|
if mode == "executed":
|
|
result = await self.inner.execute(tool_name, arguments or {}, context=context)
|
|
trace["result"] = {"success": result.success, "error": result.error, "content": result.content[:2000]}
|
|
self.traces.append(trace)
|
|
return result
|
|
if mode == "surrogate":
|
|
trace["result"] = {"success": True, "error": "replay_surrogate", "content": "Tool call recorded for surrogate evaluation."}
|
|
self.traces.append(trace)
|
|
return ToolResult(
|
|
success=True,
|
|
content="Tool call recorded for surrogate evaluation.",
|
|
tool_name=tool_name,
|
|
error="replay_surrogate",
|
|
raw_output=trace,
|
|
)
|
|
trace["result"] = {"success": False, "error": "replay_blocked", "content": "Tool call blocked by replay policy."}
|
|
self.traces.append(trace)
|
|
return ToolResult(
|
|
success=False,
|
|
content="Tool call blocked by replay policy.",
|
|
tool_name=tool_name,
|
|
error="replay_blocked",
|
|
raw_output=trace,
|
|
)
|
|
|
|
async def execute_tool_call(self, tool_call: Any, *, context: ToolContext | None = None) -> ToolResult:
|
|
tool_name, arguments = ToolExecutor._normalize_tool_call(tool_call)
|
|
return await self.execute(tool_name, arguments, context=context)
|
|
|
|
|
|
def classify_tool_mode(spec: ToolSpec, policy: ReplayToolPolicy | None = None) -> ToolExecutionMode:
|
|
policy = policy or ReplayToolPolicy()
|
|
name = spec.name.lower()
|
|
toolset = spec.toolset.lower()
|
|
metadata = {str(key).lower(): str(value).lower() for key, value in spec.metadata.items()}
|
|
if any(term in name for term in policy.destructive_terms):
|
|
return "blocked"
|
|
if toolset in policy.safe_toolsets:
|
|
return "executed"
|
|
if metadata.get("transport") in policy.surrogate_transports or toolset in {"mcp", "connector", "external"}:
|
|
if any(term in name for term in policy.external_write_terms):
|
|
return "surrogate"
|
|
return "executed"
|
|
return "surrogate"
|
|
|
|
|
|
def _classification_reason(spec: ToolSpec, mode: ToolExecutionMode) -> str:
|
|
return f"{spec.name} classified as {mode} from toolset={spec.toolset} metadata={spec.metadata}"
|
|
```
|
|
|
|
Export `ReplayToolExecutor`, `ReplayToolPolicy`, and `classify_tool_mode` from `app-instance/backend/beaver/skills/learning/__init__.py`.
|
|
|
|
- [ ] **Step 4: Run replay policy tests**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_replay.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/skills/learning/replay.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_replay.py
|
|
git commit -m "feat(skill-learning): add replay tool policy"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 6: Add AgentLoop Replay Executor Injection
|
|
|
|
**Files:**
|
|
- Modify: `app-instance/backend/beaver/engine/loop.py`
|
|
- Test: `app-instance/backend/tests/unit/test_agent_loop_replay_executor.py`
|
|
|
|
- [ ] **Step 1: Write failing replay injection test**
|
|
|
|
Create `app-instance/backend/tests/unit/test_agent_loop_replay_executor.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from pathlib import Path
|
|
from types import SimpleNamespace
|
|
|
|
import pytest
|
|
|
|
from beaver.engine.loop import AgentLoop
|
|
from beaver.engine.providers.base import LLMProvider, LLMResponse, ToolCallRequest
|
|
from beaver.engine.providers.factory import ProviderBundle
|
|
from beaver.skills.learning.replay import ReplayToolExecutor, ReplayToolPolicy
|
|
|
|
|
|
class ToolCallingProvider(LLMProvider):
|
|
def __init__(self) -> None:
|
|
self.calls = 0
|
|
|
|
async def chat(self, messages: list[dict], tools: list[dict] | None = None, model: str | None = None, max_tokens: int = 4096, temperature: float = 0.7) -> LLMResponse:
|
|
self.calls += 1
|
|
if self.calls == 1:
|
|
return LLMResponse(
|
|
content="",
|
|
tool_calls=[
|
|
ToolCallRequest(
|
|
id="call-1",
|
|
name="read_file",
|
|
arguments={"path": "README.md"},
|
|
)
|
|
],
|
|
)
|
|
return LLMResponse(content="done")
|
|
|
|
def get_default_model(self) -> str:
|
|
return "stub"
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_process_direct_uses_replay_tool_executor(tmp_path: Path) -> None:
|
|
loop = AgentLoop(workspace=tmp_path)
|
|
loaded = loop.boot()
|
|
provider = ToolCallingProvider()
|
|
runtime = SimpleNamespace(model="stub", provider_name="stub")
|
|
replay_executor = ReplayToolExecutor(
|
|
loaded.tool_executor,
|
|
registry=loaded.tool_registry,
|
|
policy=ReplayToolPolicy(),
|
|
)
|
|
|
|
result = await loop.process_direct(
|
|
"Read the README.",
|
|
provider_bundle=ProviderBundle(main_runtime=runtime, main_provider=provider), # type: ignore[arg-type]
|
|
include_skill_assembly=False,
|
|
pinned_skill_names=[],
|
|
tool_executor_override=replay_executor,
|
|
max_tool_iterations=2,
|
|
source="skill_replay_eval",
|
|
)
|
|
|
|
assert result.output_text == "done"
|
|
assert replay_executor.traces
|
|
assert replay_executor.traces[0]["tool_name"] == "read_file"
|
|
```
|
|
|
|
- [ ] **Step 2: Run test to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_agent_loop_replay_executor.py -v
|
|
```
|
|
|
|
Expected: FAIL because `process_direct()` does not accept `tool_executor_override`.
|
|
|
|
- [ ] **Step 3: Add optional executor override to AgentLoop**
|
|
|
|
In `app-instance/backend/beaver/engine/loop.py`, add `tool_executor_override: Any = None` to `process_direct()` and `_process_direct_impl()` keyword parameters. Pass the argument from `process_direct()` into `_process_direct_impl()`.
|
|
|
|
After `tool_executor = self._require_loaded("tool_executor")`, add:
|
|
|
|
```python
|
|
effective_tool_executor = tool_executor_override or tool_executor
|
|
```
|
|
|
|
Replace this line inside the tool loop:
|
|
|
|
```python
|
|
result = await tool_executor.execute_tool_call(tool_call, context=tool_context)
|
|
```
|
|
|
|
with:
|
|
|
|
```python
|
|
result = await effective_tool_executor.execute_tool_call(tool_call, context=tool_context)
|
|
```
|
|
|
|
- [ ] **Step 4: Run replay injection test**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_agent_loop_replay_executor.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 5: Run a direct-loop regression test**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_eval.py::test_eval_provider_unavailable_is_skipped_not_failed -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 6: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/engine/loop.py app-instance/backend/tests/unit/test_agent_loop_replay_executor.py
|
|
git commit -m "feat(engine): allow replay tool executor injection"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 7: Add Surrogate Evaluator
|
|
|
|
**Files:**
|
|
- Create: `app-instance/backend/beaver/skills/learning/surrogate.py`
|
|
- Modify: `app-instance/backend/beaver/skills/learning/__init__.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_surrogate.py`
|
|
|
|
- [ ] **Step 1: Write failing surrogate tests**
|
|
|
|
Create `app-instance/backend/tests/unit/test_skill_learning_surrogate.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
import asyncio
|
|
|
|
from beaver.skills.learning.surrogate import SurrogateToolEvaluator
|
|
|
|
|
|
def test_surrogate_scores_complete_candidate_higher_than_missing_baseline() -> None:
|
|
evaluator = SurrogateToolEvaluator()
|
|
baseline = {
|
|
"arm": "baseline",
|
|
"tool_calls": [
|
|
{"tool_name": "mcp_outlook_send_email", "mode": "surrogate", "arguments": {"to": "", "subject": ""}},
|
|
],
|
|
}
|
|
candidate = {
|
|
"arm": "candidate",
|
|
"tool_calls": [
|
|
{
|
|
"tool_name": "mcp_outlook_send_email",
|
|
"mode": "surrogate",
|
|
"arguments": {"to": "ada@example.com", "subject": "Status", "body": "Done"},
|
|
},
|
|
],
|
|
}
|
|
|
|
result = asyncio.run(evaluator.evaluate(task_text="Send a status email to Ada.", baseline=baseline, candidate=candidate))
|
|
|
|
assert result["candidate_score"] > result["baseline_score"]
|
|
assert result["surrogate_tool_count"] == 2
|
|
assert result["confidence"] in {"low", "medium"}
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_surrogate.py -v
|
|
```
|
|
|
|
Expected: FAIL because `surrogate.py` does not exist.
|
|
|
|
- [ ] **Step 3: Implement deterministic surrogate v1**
|
|
|
|
Create `app-instance/backend/beaver/skills/learning/surrogate.py`:
|
|
|
|
```python
|
|
"""Surrogate evaluation for replay tool calls that cannot execute safely."""
|
|
|
|
from __future__ import annotations
|
|
|
|
from typing import Any
|
|
|
|
|
|
class SurrogateToolEvaluator:
|
|
async def evaluate(self, *, task_text: str, baseline: dict[str, Any], candidate: dict[str, Any]) -> dict[str, Any]:
|
|
baseline_score = _score_arm(task_text, baseline)
|
|
candidate_score = _score_arm(task_text, candidate)
|
|
surrogate_count = _mode_count(baseline, "surrogate") + _mode_count(candidate, "surrogate")
|
|
blocked_count = _mode_count(baseline, "blocked") + _mode_count(candidate, "blocked")
|
|
confidence = "low" if blocked_count else ("medium" if surrogate_count <= 2 else "low")
|
|
return {
|
|
"baseline_score": baseline_score,
|
|
"candidate_score": candidate_score,
|
|
"delta": round(candidate_score - baseline_score, 4),
|
|
"surrogate_tool_count": surrogate_count,
|
|
"blocked_tool_count": blocked_count,
|
|
"confidence": confidence,
|
|
"notes": [
|
|
"Surrogate score is based on intended tool calls, schemas, arguments, and task relevance.",
|
|
],
|
|
}
|
|
|
|
|
|
def _score_arm(task_text: str, arm: dict[str, Any]) -> float:
|
|
calls = [item for item in arm.get("tool_calls") or [] if isinstance(item, dict)]
|
|
if not calls:
|
|
return 0.5
|
|
scores = [_score_call(task_text, call) for call in calls]
|
|
return round(sum(scores) / len(scores), 4)
|
|
|
|
|
|
def _score_call(task_text: str, call: dict[str, Any]) -> float:
|
|
if call.get("mode") == "blocked":
|
|
return 0.2
|
|
if call.get("mode") == "executed":
|
|
result = call.get("result") if isinstance(call.get("result"), dict) else {}
|
|
return 0.85 if result.get("success") is not False else 0.35
|
|
arguments = dict(call.get("arguments") or {})
|
|
if not arguments:
|
|
return 0.45
|
|
non_empty = sum(1 for value in arguments.values() if str(value).strip())
|
|
completeness = non_empty / max(1, len(arguments))
|
|
argument_text = " ".join(str(value).lower() for value in arguments.values())
|
|
relevance = 0.15 if any(token and token in argument_text for token in task_text.lower().split()[:16]) else 0.0
|
|
return round(min(0.9, 0.5 + 0.3 * completeness + relevance), 4)
|
|
|
|
|
|
def _mode_count(arm: dict[str, Any], mode: str) -> int:
|
|
return sum(1 for item in arm.get("tool_calls") or [] if isinstance(item, dict) and item.get("mode") == mode)
|
|
```
|
|
|
|
Export `SurrogateToolEvaluator` from `__init__.py`.
|
|
|
|
- [ ] **Step 4: Run surrogate tests**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_surrogate.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/skills/learning/surrogate.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_surrogate.py
|
|
git commit -m "feat(skill-learning): add surrogate tool evaluator"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 8: Add Replay Arm Runner
|
|
|
|
**Files:**
|
|
- Modify: `app-instance/backend/beaver/skills/learning/replay.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_replay_runner.py`
|
|
|
|
- [ ] **Step 1: Write failing arm runner test**
|
|
|
|
Create `app-instance/backend/tests/unit/test_skill_learning_replay_runner.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
import asyncio
|
|
from types import SimpleNamespace
|
|
|
|
from beaver.skills.learning.replay import ReplayArmRequest, ReplayRunner
|
|
|
|
|
|
class FakeAgentLoop:
|
|
def boot(self):
|
|
return SimpleNamespace(tool_executor=SimpleNamespace(), tool_registry=SimpleNamespace(get=lambda name: None))
|
|
|
|
async def process_direct(self, task: str, **kwargs):
|
|
executor = kwargs["tool_executor_override"]
|
|
await executor.execute("mcp_outlook_send_email", {"to": "ada@example.com"})
|
|
return SimpleNamespace(session_id="session-replay", run_id="run-replay", output_text="done", finish_reason="stop")
|
|
|
|
|
|
def test_replay_runner_returns_arm_report_with_tool_trace() -> None:
|
|
runner = ReplayRunner(agent_loop=FakeAgentLoop())
|
|
request = ReplayArmRequest(
|
|
case_id="case-1",
|
|
arm="candidate",
|
|
task_text="Send a status email to Ada.",
|
|
pinned_skill_names=[],
|
|
pinned_skill_contexts=[],
|
|
provider_bundle=object(),
|
|
model_settings={"max_tool_iterations": 2},
|
|
)
|
|
|
|
report = asyncio.run(runner.run_arm(request))
|
|
|
|
assert report["case_id"] == "case-1"
|
|
assert report["arm"] == "candidate"
|
|
assert report["finish_reason"] == "stop"
|
|
assert report["tool_calls"][0]["tool_name"] == "mcp_outlook_send_email"
|
|
```
|
|
|
|
- [ ] **Step 2: Run test to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_replay_runner.py -v
|
|
```
|
|
|
|
Expected: FAIL because `ReplayRunner` and `ReplayArmRequest` are not implemented.
|
|
|
|
- [ ] **Step 3: Add arm request and runner**
|
|
|
|
Append to `app-instance/backend/beaver/skills/learning/replay.py`:
|
|
|
|
```python
|
|
@dataclass(slots=True)
|
|
class ReplayArmRequest:
|
|
case_id: str
|
|
arm: str
|
|
task_text: str
|
|
pinned_skill_names: list[str] = field(default_factory=list)
|
|
pinned_skill_contexts: list[Any] = field(default_factory=list)
|
|
provider_bundle: Any | None = None
|
|
model_settings: dict[str, Any] = field(default_factory=dict)
|
|
|
|
|
|
class ReplayRunner:
|
|
def __init__(self, *, agent_loop: Any, policy: ReplayToolPolicy | None = None) -> None:
|
|
self.agent_loop = agent_loop
|
|
self.policy = policy or ReplayToolPolicy()
|
|
|
|
async def run_arm(self, request: ReplayArmRequest) -> dict[str, Any]:
|
|
loaded = self.agent_loop.boot()
|
|
replay_executor = ReplayToolExecutor(
|
|
loaded.tool_executor,
|
|
registry=loaded.tool_registry,
|
|
policy=self.policy,
|
|
)
|
|
result = await self.agent_loop.process_direct(
|
|
request.task_text,
|
|
provider_bundle=request.provider_bundle,
|
|
include_skill_assembly=False,
|
|
include_tools=True,
|
|
pinned_skill_names=request.pinned_skill_names,
|
|
pinned_skill_contexts=request.pinned_skill_contexts,
|
|
max_tool_iterations=int(request.model_settings.get("max_tool_iterations") or 4),
|
|
temperature=float(request.model_settings.get("temperature") or 0.0),
|
|
source="skill_replay_eval",
|
|
tool_executor_override=replay_executor,
|
|
)
|
|
return {
|
|
"case_id": request.case_id,
|
|
"arm": request.arm,
|
|
"session_id": result.session_id,
|
|
"run_id": result.run_id,
|
|
"task_text": request.task_text,
|
|
"finish_reason": result.finish_reason,
|
|
"final_answer": result.output_text,
|
|
"tool_calls": list(replay_executor.traces),
|
|
"artifacts": [],
|
|
"side_effects": _side_effects_from_traces(replay_executor.traces),
|
|
}
|
|
|
|
|
|
def _side_effects_from_traces(traces: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
|
effects: list[dict[str, Any]] = []
|
|
for trace in traces:
|
|
if trace.get("mode") in {"surrogate", "blocked"}:
|
|
effects.append(
|
|
{
|
|
"tool_name": trace.get("tool_name"),
|
|
"mode": trace.get("mode"),
|
|
"arguments": trace.get("arguments"),
|
|
"classification_reason": trace.get("classification_reason"),
|
|
}
|
|
)
|
|
return effects
|
|
```
|
|
|
|
Export `ReplayRunner` and `ReplayArmRequest` from `__init__.py`.
|
|
|
|
- [ ] **Step 4: Run arm runner test**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_replay_runner.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/skills/learning/replay.py app-instance/backend/beaver/skills/learning/__init__.py app-instance/backend/tests/unit/test_skill_learning_replay_runner.py
|
|
git commit -m "feat(skill-learning): run replay arms through agent loop"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 9: Orchestrate Replay Eval Reports
|
|
|
|
**Files:**
|
|
- Modify: `app-instance/backend/beaver/skills/learning/eval.py`
|
|
- Modify: `app-instance/backend/beaver/skills/learning/pipeline.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_eval.py`
|
|
|
|
- [ ] **Step 1: Add failing replay eval test**
|
|
|
|
Append to `app-instance/backend/tests/unit/test_skill_learning_eval.py`:
|
|
|
|
```python
|
|
class FakeReplayRunner:
|
|
async def run_arm(self, request):
|
|
return {
|
|
"case_id": request.case_id,
|
|
"arm": request.arm,
|
|
"session_id": "session-replay",
|
|
"run_id": f"{request.arm}-run",
|
|
"task_text": request.task_text,
|
|
"finish_reason": "stop",
|
|
"final_answer": "done",
|
|
"tool_calls": [
|
|
{
|
|
"tool_name": "write_file",
|
|
"mode": "executed",
|
|
"arguments": {"path": "README.md"},
|
|
"result": {"success": True, "content": "ok"},
|
|
}
|
|
],
|
|
"artifacts": [],
|
|
"side_effects": [],
|
|
}
|
|
|
|
|
|
def test_eval_report_includes_replay_case_and_coverage(tmp_path: Path) -> None:
|
|
pipeline = _pipeline(tmp_path)
|
|
draft = pipeline.draft_service.create_new_skill_draft(
|
|
skill_name="release-checklist",
|
|
proposed_content="# Release\n\nRun tests.",
|
|
proposed_frontmatter={"description": "release", "tools": []},
|
|
created_by="test",
|
|
reason="test",
|
|
)
|
|
pipeline.learning_store.update_learning_candidate(
|
|
"candidate-1",
|
|
draft_skill_name=draft.skill_name,
|
|
draft_id=draft.draft_id,
|
|
)
|
|
|
|
report = asyncio.run(
|
|
pipeline.evaluate_draft(
|
|
"candidate-1",
|
|
draft.skill_name,
|
|
draft.draft_id,
|
|
provider_bundle=_bundle(),
|
|
replay_runner=FakeReplayRunner(),
|
|
)
|
|
)
|
|
|
|
assert report.mode == "replay"
|
|
assert report.eval_version == "replay-v1"
|
|
assert report.case_reports
|
|
assert 0.0 <= report.execution_coverage <= 1.0
|
|
assert 0.0 <= report.surrogate_coverage <= 1.0
|
|
assert report.confidence in {"low", "medium", "high"}
|
|
```
|
|
|
|
- [ ] **Step 2: Run the new test to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_eval.py::test_eval_report_includes_replay_case_and_coverage -v
|
|
```
|
|
|
|
Expected: FAIL because evaluator still returns `mode="heuristic"`.
|
|
|
|
- [ ] **Step 3: Update evaluator imports**
|
|
|
|
In `app-instance/backend/beaver/skills/learning/eval.py`, import:
|
|
|
|
```python
|
|
from beaver.engine.context.builder import SkillContext
|
|
from beaver.skills.learning.case_selection import select_replay_cases
|
|
from beaver.skills.learning.replay import ReplayArmRequest, ReplayRunner
|
|
from beaver.skills.learning.surrogate import SurrogateToolEvaluator
|
|
from beaver.skills.learning.preservation import check_preservation
|
|
```
|
|
|
|
- [ ] **Step 4: Add optional replay runner dependency**
|
|
|
|
Update `SkillDraftEvaluator.__init__()`:
|
|
|
|
```python
|
|
def __init__(
|
|
self,
|
|
run_store: RunMemoryStore,
|
|
*,
|
|
surrogate_evaluator: SurrogateToolEvaluator | None = None,
|
|
) -> None:
|
|
self.run_store = run_store
|
|
self.surrogate_evaluator = surrogate_evaluator or SurrogateToolEvaluator()
|
|
```
|
|
|
|
Update `evaluate()` to accept the runner explicitly:
|
|
|
|
```python
|
|
async def evaluate(
|
|
self,
|
|
candidate: SkillLearningCandidate,
|
|
draft: SkillDraft,
|
|
*,
|
|
provider_bundle: ProviderBundle | None = None,
|
|
replay_runner: ReplayRunner | None = None,
|
|
) -> SkillDraftEvalReport:
|
|
```
|
|
|
|
Keep the existing provider unavailable branch. If `provider_bundle` is absent, return the current skipped-provider report. If `replay_runner` is absent, use the extracted heuristic evaluator so tests and workers without a loop do not crash.
|
|
|
|
- [ ] **Step 5: Pass replay runner through pipeline**
|
|
|
|
In `app-instance/backend/beaver/skills/learning/pipeline.py`, update `evaluate_draft()`:
|
|
|
|
```python
|
|
async def evaluate_draft(
|
|
self,
|
|
candidate_id: str,
|
|
skill_name: str,
|
|
draft_id: str,
|
|
*,
|
|
provider_bundle: ProviderBundle | None = None,
|
|
replay_runner: ReplayRunner | None = None,
|
|
) -> SkillDraftEvalReport:
|
|
```
|
|
|
|
Pass `replay_runner=replay_runner` to `evaluator.evaluate(...)`.
|
|
|
|
- [ ] **Step 6: Build replay cases in evaluate()**
|
|
|
|
Inside `evaluate()`, replace the heuristic-only case loop with:
|
|
|
|
```python
|
|
runs = self.run_store.list_runs()
|
|
replay_cases = select_replay_cases(candidate, runs)
|
|
if replay_runner is not None and replay_cases:
|
|
return await self._evaluate_replay(
|
|
candidate=candidate,
|
|
draft=draft,
|
|
replay_cases=replay_cases,
|
|
provider_bundle=provider_bundle,
|
|
replay_runner=replay_runner,
|
|
)
|
|
```
|
|
|
|
Keep the existing heuristic body as `_evaluate_heuristic()` and call it when no replay cases exist.
|
|
|
|
- [ ] **Step 7: Add `_evaluate_replay()`**
|
|
|
|
Add method:
|
|
|
|
```python
|
|
async def _evaluate_replay(
|
|
self,
|
|
*,
|
|
candidate: SkillLearningCandidate,
|
|
draft: SkillDraft,
|
|
replay_cases: list[dict],
|
|
provider_bundle: ProviderBundle,
|
|
replay_runner: ReplayRunner,
|
|
) -> SkillDraftEvalReport:
|
|
case_reports: list[dict] = []
|
|
legacy_cases: list[dict] = []
|
|
for case in replay_cases:
|
|
baseline = await replay_runner.run_arm(
|
|
ReplayArmRequest(
|
|
case_id=f"{case['run_id']}:baseline",
|
|
arm="baseline",
|
|
task_text=str(case["task_text"]),
|
|
pinned_skill_names=list(case.get("baseline_skill_names") or []),
|
|
pinned_skill_contexts=[],
|
|
provider_bundle=provider_bundle,
|
|
model_settings={"max_tool_iterations": 4, "temperature": 0.0},
|
|
)
|
|
)
|
|
candidate_arm = await replay_runner.run_arm(
|
|
ReplayArmRequest(
|
|
case_id=f"{case['run_id']}:candidate",
|
|
arm="candidate",
|
|
task_text=str(case["task_text"]),
|
|
pinned_skill_names=[],
|
|
pinned_skill_contexts=[_draft_skill_context(draft)],
|
|
provider_bundle=provider_bundle,
|
|
model_settings={"max_tool_iterations": 4, "temperature": 0.0},
|
|
)
|
|
)
|
|
surrogate = await self.surrogate_evaluator.evaluate(
|
|
task_text=str(case["task_text"]),
|
|
baseline=baseline,
|
|
candidate=candidate_arm,
|
|
)
|
|
baseline_score = surrogate["baseline_score"]
|
|
candidate_score = surrogate["candidate_score"]
|
|
case_report = {
|
|
"run_id": case["run_id"],
|
|
"task_id": case.get("task_id"),
|
|
"session_id": case.get("session_id"),
|
|
"baseline": baseline,
|
|
"candidate": candidate_arm,
|
|
"baseline_score": baseline_score,
|
|
"candidate_score": candidate_score,
|
|
"delta": round(candidate_score - baseline_score, 4),
|
|
"confidence": surrogate["confidence"],
|
|
"validator_notes": list(surrogate.get("notes") or []),
|
|
}
|
|
case_reports.append(case_report)
|
|
legacy_cases.append(
|
|
{
|
|
"run_id": case["run_id"],
|
|
"session_id": case.get("session_id") or "",
|
|
"baseline_score": baseline_score,
|
|
"candidate_score": candidate_score,
|
|
"delta": round(candidate_score - baseline_score, 4),
|
|
}
|
|
)
|
|
preservation_report = _preservation_report(candidate, draft)
|
|
return _report_from_case_reports(candidate, draft, case_reports, legacy_cases, preservation_report)
|
|
```
|
|
|
|
- [ ] **Step 8: Add helper functions**
|
|
|
|
Add module-level helpers in `eval.py`:
|
|
|
|
```python
|
|
def _draft_skill_context(draft: SkillDraft) -> SkillContext:
|
|
tool_hints = draft.proposed_frontmatter.get("tools")
|
|
return SkillContext(
|
|
name=f"draft:{draft.skill_name}",
|
|
content=draft.proposed_content,
|
|
version=draft.draft_id,
|
|
content_hash="draft",
|
|
activation_reason="skill_replay_eval_candidate",
|
|
tool_hints=[str(item) for item in tool_hints if str(item).strip()] if isinstance(tool_hints, list) else [],
|
|
)
|
|
|
|
|
|
def _preservation_report(candidate: SkillLearningCandidate, draft: SkillDraft) -> dict | None:
|
|
if candidate.kind not in {"revise_skill", "merge_skills"}:
|
|
return None
|
|
base_content = str(candidate.evidence.get("base_content") or "") if isinstance(candidate.evidence, dict) else ""
|
|
if not base_content.strip():
|
|
return None
|
|
return check_preservation(base_content=base_content, draft_content=draft.proposed_content)
|
|
|
|
|
|
def _report_from_case_reports(
|
|
candidate: SkillLearningCandidate,
|
|
draft: SkillDraft,
|
|
case_reports: list[dict],
|
|
legacy_cases: list[dict],
|
|
preservation_report: dict | None,
|
|
) -> SkillDraftEvalReport:
|
|
baseline_avg = sum(item["baseline_score"] for item in legacy_cases) / len(legacy_cases)
|
|
candidate_avg = sum(item["candidate_score"] for item in legacy_cases) / len(legacy_cases)
|
|
regressions = [item for item in legacy_cases if item["candidate_score"] < item["baseline_score"]]
|
|
improved = [item for item in legacy_cases if item["candidate_score"] > item["baseline_score"]]
|
|
unchanged = len(legacy_cases) - len(regressions) - len(improved)
|
|
execution, surrogate, blocked = _coverage(case_reports)
|
|
confidence = _confidence(execution, surrogate, blocked, [item.get("confidence") for item in case_reports])
|
|
score_delta = candidate_avg - baseline_avg
|
|
passed = candidate_avg >= 0.75 and not (regressions and score_delta <= 0) and blocked < 1.0
|
|
return SkillDraftEvalReport(
|
|
report_id=uuid4().hex,
|
|
skill_name=draft.skill_name,
|
|
draft_id=draft.draft_id,
|
|
candidate_id=candidate.candidate_id,
|
|
passed=passed,
|
|
baseline_score_avg=round(baseline_avg, 4),
|
|
candidate_score_avg=round(candidate_avg, 4),
|
|
score_delta=round(score_delta, 4),
|
|
regression_count=len(regressions),
|
|
improved_count=len(improved),
|
|
unchanged_count=unchanged,
|
|
cases=legacy_cases,
|
|
status="completed",
|
|
created_at=_utc_now(),
|
|
eval_version="replay-v1",
|
|
mode="replay",
|
|
execution_coverage=execution,
|
|
surrogate_coverage=surrogate,
|
|
blocked_coverage=blocked,
|
|
confidence=confidence,
|
|
case_reports=case_reports,
|
|
tool_mode_summary={"executed": execution, "surrogate": surrogate, "blocked": blocked},
|
|
preservation_report=preservation_report,
|
|
)
|
|
```
|
|
|
|
Add `_coverage()` and `_confidence()`:
|
|
|
|
```python
|
|
def _coverage(case_reports: list[dict]) -> tuple[float, float, float]:
|
|
counts = {"executed": 0, "surrogate": 0, "blocked": 0}
|
|
for report in case_reports:
|
|
for arm_name in ("baseline", "candidate"):
|
|
arm = report.get(arm_name) if isinstance(report.get(arm_name), dict) else {}
|
|
for call in arm.get("tool_calls") or []:
|
|
if isinstance(call, dict) and call.get("mode") in counts:
|
|
counts[str(call["mode"])] += 1
|
|
total = sum(counts.values())
|
|
if total == 0:
|
|
return 1.0, 0.0, 0.0
|
|
return (
|
|
round(counts["executed"] / total, 4),
|
|
round(counts["surrogate"] / total, 4),
|
|
round(counts["blocked"] / total, 4),
|
|
)
|
|
|
|
|
|
def _confidence(execution: float, surrogate: float, blocked: float, case_confidences: list[object]) -> str:
|
|
if blocked > 0.0:
|
|
return "low"
|
|
if execution >= 0.75 and surrogate <= 0.25:
|
|
return "high"
|
|
if execution >= 0.25 or "medium" in case_confidences:
|
|
return "medium"
|
|
return "low"
|
|
```
|
|
|
|
- [ ] **Step 9: Run eval test**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_eval.py::test_eval_report_includes_replay_case_and_coverage -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 10: Run all skill learning backend tests touched so far**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_eval_report_model.py tests/unit/test_skill_learning_case_selection.py tests/unit/test_skill_learning_preservation.py tests/unit/test_skill_learning_replay.py tests/unit/test_skill_learning_replay_runner.py tests/unit/test_skill_learning_surrogate.py tests/unit/test_skill_learning_eval.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 11: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/skills/learning/eval.py app-instance/backend/beaver/skills/learning/pipeline.py app-instance/backend/tests/unit/test_skill_learning_eval.py
|
|
git commit -m "feat(skill-learning): produce replay eval reports"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 10: Apply Preservation Report During Eval And Publish Gates
|
|
|
|
**Files:**
|
|
- Modify: `app-instance/backend/beaver/skills/learning/eval.py`
|
|
- Modify: `app-instance/backend/beaver/skills/learning/pipeline.py`
|
|
- Test: `app-instance/backend/tests/unit/test_skill_learning_pipeline.py`
|
|
|
|
- [ ] **Step 1: Write failing publish gate test**
|
|
|
|
Append to `app-instance/backend/tests/unit/test_skill_learning_pipeline.py`:
|
|
|
|
```python
|
|
def test_publish_blocks_low_confidence_replay_report(tmp_path: Path) -> None:
|
|
pipeline = _pipeline(tmp_path)
|
|
draft = pipeline.draft_service.create_new_skill_draft(
|
|
skill_name="low-confidence",
|
|
proposed_content="# Low\n\nDo it.",
|
|
proposed_frontmatter={"description": "low", "tools": ["mcp_outlook_send_email"]},
|
|
created_by="test",
|
|
reason="test",
|
|
)
|
|
pipeline.learning_store.write_eval_report(
|
|
SkillDraftEvalReport(
|
|
report_id="eval-low",
|
|
skill_name=draft.skill_name,
|
|
draft_id=draft.draft_id,
|
|
candidate_id="candidate-1",
|
|
passed=True,
|
|
baseline_score_avg=0.7,
|
|
candidate_score_avg=0.9,
|
|
score_delta=0.2,
|
|
regression_count=0,
|
|
improved_count=1,
|
|
unchanged_count=0,
|
|
confidence="low",
|
|
mode="replay",
|
|
eval_version="replay-v1",
|
|
execution_coverage=0.0,
|
|
surrogate_coverage=1.0,
|
|
blocked_coverage=0.0,
|
|
)
|
|
)
|
|
pipeline.check_safety(draft.skill_name, draft.draft_id)
|
|
pipeline.submit_review(draft.skill_name, draft.draft_id, requested_by="tester")
|
|
pipeline.approve(draft.skill_name, draft.draft_id, reviewer="tester")
|
|
|
|
with pytest.raises(ValueError, match="low confidence"):
|
|
pipeline.publish(draft.skill_name, draft.draft_id, publisher="tester")
|
|
```
|
|
|
|
Add import if missing:
|
|
|
|
```python
|
|
from beaver.memory.skills import SkillDraftEvalReport
|
|
```
|
|
|
|
- [ ] **Step 2: Run test to verify failure**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_pipeline.py::test_publish_blocks_low_confidence_replay_report -v
|
|
```
|
|
|
|
Expected: FAIL because publish gate does not check confidence.
|
|
|
|
- [ ] **Step 3: Update publish gates**
|
|
|
|
In `SkillLearningPipelineService._validate_publish_gates()`, after the existing eval pass check, add:
|
|
|
|
```python
|
|
if eval_report is not None and eval_report.mode == "replay":
|
|
if eval_report.confidence == "low":
|
|
raise ValueError("Draft replay eval has low confidence and requires revision before publish")
|
|
if eval_report.blocked_coverage >= 1.0:
|
|
raise ValueError("Draft replay eval blocked all important tool calls")
|
|
preservation = eval_report.preservation_report or {}
|
|
if preservation.get("passed") is False:
|
|
raise ValueError("Draft preservation check did not pass")
|
|
```
|
|
|
|
- [ ] **Step 4: Add preservation gate test**
|
|
|
|
Append to `app-instance/backend/tests/unit/test_skill_learning_pipeline.py`:
|
|
|
|
```python
|
|
def test_publish_blocks_failed_preservation_report(tmp_path: Path) -> None:
|
|
pipeline = _pipeline(tmp_path)
|
|
draft = pipeline.draft_service.create_new_skill_draft(
|
|
skill_name="dropped-section",
|
|
proposed_content="# Skill\n\n## Workflow\n\nDo it.",
|
|
proposed_frontmatter={"description": "dropped", "tools": []},
|
|
created_by="test",
|
|
reason="test",
|
|
)
|
|
pipeline.learning_store.write_eval_report(
|
|
SkillDraftEvalReport(
|
|
report_id="eval-preservation",
|
|
skill_name=draft.skill_name,
|
|
draft_id=draft.draft_id,
|
|
candidate_id="candidate-1",
|
|
passed=True,
|
|
baseline_score_avg=0.7,
|
|
candidate_score_avg=0.9,
|
|
score_delta=0.2,
|
|
regression_count=0,
|
|
improved_count=1,
|
|
unchanged_count=0,
|
|
confidence="medium",
|
|
mode="replay",
|
|
eval_version="replay-v1",
|
|
preservation_report={"passed": False, "risk_level": "high", "dropped_sections": ["Safety"]},
|
|
)
|
|
)
|
|
pipeline.check_safety(draft.skill_name, draft.draft_id)
|
|
pipeline.submit_review(draft.skill_name, draft.draft_id, requested_by="tester")
|
|
pipeline.approve(draft.skill_name, draft.draft_id, reviewer="tester")
|
|
|
|
with pytest.raises(ValueError, match="preservation"):
|
|
pipeline.publish(draft.skill_name, draft.draft_id, publisher="tester")
|
|
```
|
|
|
|
- [ ] **Step 5: Run pipeline tests**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_pipeline.py::test_publish_blocks_low_confidence_replay_report tests/unit/test_skill_learning_pipeline.py::test_publish_blocks_failed_preservation_report -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 6: Run pipeline suite**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_pipeline.py tests/unit/test_skill_learning_eval.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 7: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/skills/learning/eval.py app-instance/backend/beaver/skills/learning/pipeline.py app-instance/backend/tests/unit/test_skill_learning_pipeline.py
|
|
git commit -m "feat(skill-learning): gate publish on replay confidence"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 11: Update Web API Types And Skills UI
|
|
|
|
**Files:**
|
|
- Modify: `app-instance/backend/beaver/interfaces/web/app.py`
|
|
- Modify: `app-instance/frontend/types/index.ts`
|
|
- Modify: `app-instance/frontend/app/(app)/skills/page.tsx`
|
|
|
|
- [ ] **Step 1: Inject replay runner in Web API draft evaluation**
|
|
|
|
In `app-instance/backend/beaver/interfaces/web/app.py`, import:
|
|
|
|
```python
|
|
from beaver.skills.learning.replay import ReplayRunner
|
|
```
|
|
|
|
In both `synthesize_skill_draft()` and `regenerate_skill_draft()`, keep the loop object instead of only the loaded result:
|
|
|
|
```python
|
|
loop = agent_service.create_loop()
|
|
loaded = loop.boot()
|
|
```
|
|
|
|
Update each `evaluate_draft(...)` call:
|
|
|
|
```python
|
|
await loaded.skill_learning_pipeline.evaluate_draft( # type: ignore[union-attr]
|
|
candidate_id,
|
|
draft.skill_name,
|
|
draft.draft_id,
|
|
provider_bundle=provider_bundle,
|
|
replay_runner=ReplayRunner(agent_loop=loop),
|
|
)
|
|
```
|
|
|
|
This makes the normal UI path use real replay arms through the isolated `skill_replay_eval` source.
|
|
|
|
- [ ] **Step 2: Update TypeScript types**
|
|
|
|
In `app-instance/frontend/types/index.ts`, extend `SkillDraftEvalReport`:
|
|
|
|
```ts
|
|
export interface SkillDraftEvalReport {
|
|
report_id: string;
|
|
skill_name: string;
|
|
draft_id: string;
|
|
candidate_id: string;
|
|
passed: boolean;
|
|
baseline_score_avg: number;
|
|
candidate_score_avg: number;
|
|
score_delta: number;
|
|
regression_count: number;
|
|
improved_count: number;
|
|
unchanged_count: number;
|
|
cases: Array<Record<string, unknown>>;
|
|
status: string;
|
|
created_at: string;
|
|
eval_version?: string;
|
|
mode?: 'heuristic' | 'replay' | string;
|
|
execution_coverage?: number;
|
|
surrogate_coverage?: number;
|
|
blocked_coverage?: number;
|
|
confidence?: 'low' | 'medium' | 'high' | string;
|
|
case_reports?: Array<Record<string, unknown>>;
|
|
tool_mode_summary?: Record<string, unknown>;
|
|
preservation_report?: Record<string, unknown> | null;
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 3: Add UI metric tiles**
|
|
|
|
In `EvalReportPanel` in `app-instance/frontend/app/(app)/skills/page.tsx`, add metric tiles after Delta:
|
|
|
|
```tsx
|
|
<MetricTile label={t('执行覆盖', 'Execution')} value={formatPercent(report.execution_coverage)} />
|
|
<MetricTile label={t('替代评估', 'Surrogate')} value={formatPercent(report.surrogate_coverage)} />
|
|
<MetricTile label={t('置信度', 'Confidence')} value={report.confidence || 'low'} />
|
|
```
|
|
|
|
Add helper near `formatScore`:
|
|
|
|
```tsx
|
|
function formatPercent(value?: number | null): string {
|
|
if (typeof value !== 'number' || Number.isNaN(value)) return '0%';
|
|
return `${Math.round(value * 100)}%`;
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 4: Add replay details sections**
|
|
|
|
Inside `EvalReportPanel`, after existing replay cases table, add:
|
|
|
|
```tsx
|
|
{Array.isArray(report.case_reports) && report.case_reports.length > 0 ? (
|
|
<RawDetails title={t('Replay case reports', 'Replay case reports')} value={report.case_reports} />
|
|
) : null}
|
|
{report.preservation_report ? (
|
|
<RawDetails title={t('Preservation report', 'Preservation report')} value={report.preservation_report} />
|
|
) : null}
|
|
```
|
|
|
|
- [ ] **Step 5: Run backend web API test**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_web_api.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 6: Run frontend type check or lint**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/frontend
|
|
npm run lint
|
|
```
|
|
|
|
Expected: PASS. If this repo does not define `lint`, run:
|
|
|
|
```bash
|
|
cd app-instance/frontend
|
|
npm test -- --runInBand
|
|
```
|
|
|
|
Expected: PASS or existing unrelated failures documented in the final implementation report.
|
|
|
|
- [ ] **Step 7: Commit**
|
|
|
|
```bash
|
|
git add app-instance/backend/beaver/interfaces/web/app.py app-instance/frontend/types/index.ts app-instance/frontend/app/\(app\)/skills/page.tsx
|
|
git commit -m "feat(skills-ui): show replay eval coverage"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 12: Final Verification
|
|
|
|
**Files:**
|
|
- No new source files.
|
|
|
|
- [ ] **Step 1: Run backend skill learning tests**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/backend
|
|
pytest tests/unit/test_skill_learning_eval_report_model.py tests/unit/test_skill_learning_case_selection.py tests/unit/test_skill_learning_preservation.py tests/unit/test_skill_learning_replay.py tests/unit/test_skill_learning_surrogate.py tests/unit/test_skill_learning_eval.py tests/unit/test_skill_learning_pipeline.py tests/unit/test_skill_learning_worker.py tests/unit/test_skill_learning_web_api.py -v
|
|
```
|
|
|
|
Expected: PASS.
|
|
|
|
- [ ] **Step 2: Run frontend verification**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
cd app-instance/frontend
|
|
npm run lint
|
|
```
|
|
|
|
Expected: PASS. If unavailable, run the closest existing frontend test command and record the exact command.
|
|
|
|
- [ ] **Step 3: Inspect git status**
|
|
|
|
Run:
|
|
|
|
```bash
|
|
git status --short
|
|
```
|
|
|
|
Expected: only intentional implementation changes are present; unrelated user changes remain untouched.
|
|
|
|
- [ ] **Step 4: Commit any final fixes**
|
|
|
|
If verification required fixes:
|
|
|
|
```bash
|
|
git add <changed-files>
|
|
git commit -m "fix(skill-learning): stabilize replay eval"
|
|
```
|
|
|
|
Expected: commit succeeds.
|