# Skill-Templated Task Graph Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Extend the existing Beaver task-team runtime so optional Skill templates guide minimal, generic-worker execution graphs with node tool scopes and evidence completion gates. **Architecture:** `ExecutionGraph`, `ExecutionNode`, and `LocalAgentRunner` remain the only Team runtime model. An optional JSON block in an activated Skill becomes planner input; the planner validates and adapts it into the existing graph; workers enforce graph-provided tool/evidence contracts; tool-free synthesis receives a deterministic complete/incomplete outcome. **Tech Stack:** Python 3.12, dataclasses, asyncio, pytest, existing Beaver catalog, engine, coordinator, and task runtime. --- ## File Structure - `beaver/skills/catalog/utils.py`: optional template parser. - `beaver/skills/catalog/loader.py`, `engine/context/builder.py`, `skills/assembler/task_assembler.py`: carry parse output into `SkillContext`. - `beaver/coordinator/models.py`: defaults-only extensions to current graph/node/result contracts. - `beaver/tasks/planner.py`: template-aware task-only JSON schema, repair, graph validation, and adaptation event metadata. - `beaver/engine/loader.py`: inject registry into planner. - `beaver/engine/loop.py`, `tools/runtime/executor.py`, `coordinator/local.py`: node allowlist and budget enforcement. - `beaver/tasks/evidence.py`, `coordinator/execution/scheduler.py`, `tasks/attempt_orchestrator.py`: evidence completion and incomplete synthesis gate. ## Execution Reporting Rule Do not commit automatically. After every task, stop and report the modified-file list, exact test command and result, `git diff --stat` summary, and remaining risks. Commit only when the user explicitly asks. ### Task 1: Parse and Propagate Optional Skill Templates **Files:** - Modify: `app-instance/backend/beaver/skills/catalog/utils.py` - Modify: `app-instance/backend/beaver/skills/catalog/loader.py` - Modify: `app-instance/backend/beaver/engine/context/builder.py` - Modify: `app-instance/backend/beaver/skills/assembler/task_assembler.py` - Create: `app-instance/backend/tests/unit/test_skill_team_template.py` - [ ] **Step 1: Write failing parser tests** ```python def test_extract_team_template_returns_none_when_block_is_absent() -> None: result = extract_skill_team_template("# Ordinary Skill") assert result.template is None assert result.warnings == [] def test_extract_team_template_parses_valid_json_block() -> None: result = extract_skill_team_template( "```beaver-team-template\\n" '{"version": 1, "nodes": [{"node_id": "collect", "task": "Collect"}]}\\n```' ) assert result.template["nodes"][0]["node_id"] == "collect" def test_invalid_template_is_warning_not_skill_load_failure() -> None: result = extract_skill_team_template("```beaver-team-template\\nnot-json\\n```") assert result.template is None assert result.warnings == ["team template JSON is invalid"] ``` - [ ] **Step 2: Run it to verify failure** Run: `cd app-instance/backend && uv run pytest tests/unit/test_skill_team_template.py -q` Expected: FAIL because `extract_skill_team_template` does not exist. - [ ] **Step 3: Implement the parser and propagation** ```python @dataclass(slots=True) class SkillTeamTemplateParseResult: template: dict[str, Any] | None = None warnings: list[str] = field(default_factory=list) def extract_skill_team_template(body: str) -> SkillTeamTemplateParseResult: matches = re.findall(r"```beaver-team-template\\s*\\n(.*?)\\n```", body, re.DOTALL) if not matches: return SkillTeamTemplateParseResult() if len(matches) != 1: return SkillTeamTemplateParseResult(warnings=["skill defines multiple team templates"]) try: value = json.loads(matches[0]) except json.JSONDecodeError: return SkillTeamTemplateParseResult(warnings=["team template JSON is invalid"]) if not isinstance(value, dict) or not isinstance(value.get("nodes", []), list): return SkillTeamTemplateParseResult(warnings=["team template must be an object with a nodes list"]) return SkillTeamTemplateParseResult(template=value) ``` Add defaults-only `team_template` and `team_template_warnings` fields to `SkillRecord` and `SkillContext`; populate them from stripped Skill body in loader/assembler. Keep the original Skill body available for normal prompt injection. - [ ] **Step 4: Run parser and assembler regression tests** Run: `cd app-instance/backend && uv run pytest tests/unit/test_skill_team_template.py tests/unit/test_skill_assembler.py -q` Expected: PASS. - [ ] **Step 5: Stop and report; do not commit** Report the modified files, parser/assembler test result, `git diff --stat`, and any template compatibility risk. Do not commit unless explicitly asked. ### Task 2: Extend Existing Graph Contracts **Files:** - Modify: `app-instance/backend/beaver/coordinator/models.py` - Modify: `app-instance/backend/tests/unit/test_agent_team_v1.py` - [ ] **Step 1: Write failing contract and depth tests** ```python def test_execution_node_contracts_default_for_existing_callers() -> None: node = ExecutionNode("collect", "Collect", AgentDescriptor(name="collect")) assert node.allowed_tool_names is None assert node.required_evidence == [] assert node.evidence_contract == {} assert node.required_for_completion is True assert node.block_downstream_on_partial is False def test_graph_rejects_depth_above_configured_limit() -> None: graph = ExecutionGraph( strategy="dag", nodes=[ ExecutionNode("a", "A", AgentDescriptor(name="a")), ExecutionNode("b", "B", AgentDescriptor(name="b"), depends_on=["a"]), ExecutionNode("c", "C", AgentDescriptor(name="c"), depends_on=["b"]), ], ) with pytest.raises(ValueError, match="max depth"): graph.validate(max_depth=2) ``` - [ ] **Step 2: Run it to verify failure** Run: `cd app-instance/backend && uv run pytest tests/unit/test_agent_team_v1.py -q` Expected: FAIL because fields and `max_depth` do not exist. - [ ] **Step 3: Add only defaults to the current dataclasses** ```python input_contract: dict[str, Any] = field(default_factory=dict) output_contract: dict[str, Any] = field(default_factory=dict) allowed_tool_names: list[str] | None = None required_evidence: list[str] = field(default_factory=list) evidence_contract: dict[str, Any] = field(default_factory=dict) validation_rules: list[str] = field(default_factory=list) required_for_completion: bool = True block_downstream_on_partial: bool = False max_tool_iterations: int | None = None ``` Use `allowed_tool_names: list[str] | None = None`, not a default empty list. `None` means no node-level scope and keeps legacy behavior; `[]` explicitly disables tools; a populated list is the node allowlist. Add runtime-relevant values to `DelegationEnvelope`. Add `completion_status="succeeded"` and `evidence_gaps=[]` to `NodeRunResult`. Extend `ExecutionGraph.validate(max_depth: int | None = None)` to calculate longest dependency chain with its existing DFS and raise only when an explicit limit is exceeded. - [ ] **Step 4: Run the coordinator regression test** Run: `cd app-instance/backend && uv run pytest tests/unit/test_agent_team_v1.py -q` Expected: PASS. - [ ] **Step 5: Stop and report; do not commit** Report the modified files, coordinator test result, `git diff --stat`, and compatibility risk for existing direct graph callers. Do not commit unless explicitly asked. ### Task 3: Adapt Templates Into Generic Task Graphs **Files:** - Modify: `app-instance/backend/beaver/tasks/planner.py` - Modify: `app-instance/backend/beaver/tasks/attempt_orchestrator.py` - Modify: `app-instance/backend/beaver/engine/loader.py` - Modify: `app-instance/backend/tests/unit/test_task_execution_planner.py` - [ ] **Step 1: Write failing planner tests** ```python def test_template_plan_creates_generic_worker_not_role_agent() -> None: plan = TaskExecutionPlanner(tool_registry=_registry()).from_json( '{"mode":"team","strategy":"dag","nodes":[{"node_id":"collect","task":"Collect",' '"requested_tools":["web_search"]}],"adaptation":{"template_used":true}}' ) node = plan.graph.nodes[0] assert node.agent.role == "" assert node.agent.metadata["sub_agent_kind"] == "generic_skill_worker" assert plan.planner_adaptation["template_used"] is True def test_unknown_tool_is_removed_and_warned() -> None: plan = TaskExecutionPlanner(tool_registry=_registry()).from_json( '{"mode":"team","strategy":"sequence","nodes":[{"node_id":"collect","task":"Collect",' '"requested_tools":["web_search","not_real"]}]}' ) assert plan.graph.nodes[0].allowed_tool_names == ["web_search"] assert "unknown tool removed: not_real" in plan.planner_adaptation["warnings"] def test_high_risk_tool_is_removed_without_failing_low_risk_plan() -> None: plan = TaskExecutionPlanner(tool_registry=_registry()).from_json( '{"mode":"team","strategy":"sequence","nodes":[{"node_id":"collect","task":"Collect",' '"requested_tools":["web_search","terminal"]}]}' ) assert plan.graph.nodes[0].allowed_tool_names == ["web_search"] assert "requires_high_risk_review: terminal" in plan.planner_adaptation["warnings"] ``` - [ ] **Step 2: Run it to verify failure** Run: `cd app-instance/backend && uv run pytest tests/unit/test_task_execution_planner.py -q` Expected: FAIL because planner has no template context, registry policy, or adaptation report. - [ ] **Step 3: Implement task-only planner schema and one repair attempt** Add `tool_registry: ToolRegistry | None` to `TaskExecutionPlanner`. Change `plan()` to receive `activated_skills: list[SkillContext]`, select at most one valid template, and include it in `_prompt`. Add `planner_adaptation: dict[str, Any] = field(default_factory=dict)` to `TaskExecutionPlan` and `to_event_payload()`. Accept only `node_id`, `task`, `depends_on`, `input_contract`, `output_contract`, `requested_tools`, `required_evidence`, `evidence_contract`, `validation_rules`, `required_for_completion`, `block_downstream_on_partial`, `max_tool_iterations`, and `constraints`. Reject `agent` and `role`; construct `AgentDescriptor(name=node_id, role="", system_prompt="", metadata={"sub_agent_kind": "generic_skill_worker", ...})` internally. Resolve requested names through registry plus a conservative interim name-based risk policy. Treat `terminal`, `execute_command`, `write_file`, `delete_file`, `external_send`, and `send_email` as high-risk until stable `ToolSpec.metadata` risk fields exist. Write allowed names to `ExecutionNode.allowed_tool_names`; remove unknown/high-risk names and record warnings. Unknown tools never fail the whole plan; high-risk tools add `requires_high_risk_review` and are never auto-approved. Validate node count, dependencies, cycles, and `graph.validate(max_depth=4)`. If first provider output is invalid, make exactly one `tools=None` repair request containing validation errors; if it is still invalid, return `TaskExecutionPlan.single("planner_fallback_single", fallback_error=...)`. Update `TaskAttemptOrchestrator` to pass `preselected_skills`, and `EngineLoader` to construct planner with its registry. - [ ] **Step 4: Run planner and task-mode regression tests** Run: `cd app-instance/backend && uv run pytest tests/unit/test_task_execution_planner.py tests/unit/test_task_mode_feedback.py -q` Expected: PASS. - [ ] **Step 5: Stop and report; do not commit** Report the modified files, planner/task-mode test result, `git diff --stat`, and any risk-policy false-positive risk. Do not commit unless explicitly asked. ### Task 4: Enforce Node Tool Allowlists **Files:** - Modify: `app-instance/backend/beaver/engine/loop.py` - Modify: `app-instance/backend/beaver/tools/runtime/executor.py` - Modify: `app-instance/backend/beaver/coordinator/local.py` - Modify: `app-instance/backend/tests/unit/test_agent_loop.py` - Create: `app-instance/backend/tests/unit/test_team_node_tool_policy.py` - [ ] **Step 1: Write failing schema and executor tests** ```python def test_team_node_exposes_only_allowed_tool_schema() -> None: asyncio.run(loop.process_direct("collect", allowed_tool_names=["web_search"])) assert _tool_names(provider.calls[0]["tools"]) == ["web_search"] def test_none_tool_scope_preserves_legacy_selection_and_empty_scope_disables_all() -> None: asyncio.run(loop.process_direct("collect", allowed_tool_names=None)) assert _tool_names(provider.calls[0]["tools"]) asyncio.run(loop.process_direct("collect", allowed_tool_names=[])) assert _tool_names(provider.calls[1]["tools"]) == [] def test_executor_rejects_registered_tool_outside_node_allowlist() -> None: context = ToolContext(metadata={"allowed_tool_names": ["web_search"]}) result = asyncio.run(executor.execute("write_file", {"path": "x", "content": "x"}, context=context)) assert result.success is False assert result.error == "tool_not_allowed" ``` - [ ] **Step 2: Run it to verify failure** Run: `cd app-instance/backend && uv run pytest tests/unit/test_team_node_tool_policy.py -q` Expected: FAIL because allowlists are not accepted or enforced. - [ ] **Step 3: Filter provider schemas and deny in executor** Add `allowed_tool_names: list[str] | None = None` to `AgentLoop.process_direct()` and `_process_direct_impl()`. Filter assembled tool specs only when it is not `None`, and place it in `ToolContext.metadata`. Pass node scope and budget from `LocalAgentRunner`. ```python allowed = context.metadata.get("allowed_tool_names") if context is not None else None if isinstance(allowed, list) and tool_name not in allowed: return ToolResult(False, f"Tool {tool_name} is not allowed for this node.", tool_name, "tool_not_allowed") ``` Keep `None` distinct from `[]`: `None` preserves current single-agent behavior; an empty Team-node list exposes no tools. - [ ] **Step 4: Run focused and loop regression tests** Run: `cd app-instance/backend && uv run pytest tests/unit/test_team_node_tool_policy.py tests/unit/test_agent_loop.py -q` Expected: PASS. - [ ] **Step 5: Stop and report; do not commit** Report the modified files, focused/loop test result, `git diff --stat`, and risk that a legacy caller accidentally passes `[]`. Do not commit unless explicitly asked. ### Task 5: Gate Node Success on Required Evidence **Files:** - Modify: `app-instance/backend/beaver/tasks/evidence.py` - Modify: `app-instance/backend/beaver/coordinator/local.py` - Modify: `app-instance/backend/beaver/coordinator/execution/scheduler.py` - Modify: `app-instance/backend/tests/unit/test_agent_team_v1.py` - Modify: `app-instance/backend/tests/unit/test_task_evidence.py` - [ ] **Step 1: Write failing evidence-completion tests** ```python def test_node_without_required_tool_result_is_partial() -> None: result = asyncio.run(runner.run(_envelope(required_evidence=["tool_result"]))) assert result.success is False assert result.completion_status == "partial" assert result.evidence_gaps == ["missing required evidence: tool_result"] def test_node_without_evidence_requirement_keeps_legacy_success() -> None: result = asyncio.run(runner.run(_envelope(required_evidence=[]))) assert result.success is True assert result.completion_status == "succeeded" def test_dag_allows_partial_evidence_by_default() -> None: outcome = asyncio.run(scheduler.run(_graph_with_partial_collect_node(), parent_task_id=None, parent_session_id="s")) assert outcome.node_results[1].completion_status == "succeeded" def test_dag_blocks_partial_node_only_when_node_requests_it() -> None: outcome = asyncio.run(scheduler.run(_graph_with_blocking_partial_collect_node(), parent_task_id=None, parent_session_id="s")) assert outcome.node_results[1].finish_reason == "blocked" ``` - [ ] **Step 2: Run it to verify failure** Run: `cd app-instance/backend && uv run pytest tests/unit/test_agent_team_v1.py tests/unit/test_task_evidence.py -q` Expected: FAIL because evidence requirements do not affect node success. - [ ] **Step 3: Implement deterministic evidence checks** Add `evaluate_node_evidence(evidence, required_evidence, output_text) -> list[str]`. `required_evidence` is a coarse v1 gate: `tool_result` requires a successful tool result, `url` a tool result URL, and `output` non-empty output; any other requirement produces `unsupported evidence requirement: `. Do not interpret `evidence_contract` in v1. After `LocalAgentRunner` builds `RunEvidence`, set `completion_status="partial"`, `success=False`, and gaps only when the node actually declares `required_evidence`. Leave existing no-requirement node success behavior unchanged. Scheduler always blocks `failed`/`blocked`; it passes partial output/evidence onward unless `block_downstream_on_partial=True`. - [ ] **Step 4: Run coordinator and evidence regression tests** Run: `cd app-instance/backend && uv run pytest tests/unit/test_agent_team_v1.py tests/unit/test_task_evidence.py -q` Expected: PASS. - [ ] **Step 5: Stop and report; do not commit** Report the modified files, coordinator/evidence test result, `git diff --stat`, and the known coarse-evidence limitation. Do not commit unless explicitly asked. ### Task 6: Gate Final Synthesis and Verify Finance Planning **Files:** - Modify: `app-instance/backend/beaver/tasks/attempt_orchestrator.py` - Modify: `app-instance/backend/tests/unit/test_task_mode_feedback.py` - Create: `app-instance/backend/tests/unit/test_task_team_synthesis_outcome.py` - Modify: `app-instance/backend/tests/unit/test_task_execution_planner.py` - Modify: `app-instance/backend/tests/unit/test_task_skill_resolver.py` - [ ] **Step 1: Write failing outcome and finance tests** ```python def test_required_partial_node_marks_synthesis_incomplete() -> None: context, prefix = orchestrator._team_synthesis_outcome(_plan(), _team_result(partial_required=True)) assert "Task outcome: incomplete" in context assert prefix.startswith("任务未完成:") def test_finance_template_adapts_to_read_only_task_graph() -> None: plan = planner.from_json(_finance_plan_json()) assert [node.node_id for node in plan.graph.nodes] == [ "collect_official_sources", "extract_financial_metrics", "validate_metrics", "generate_chart_report" ] assert all(node.agent.role == "" for node in plan.graph.nodes) assert plan.graph.nodes[0].allowed_tool_names == ["web_search", "web_fetch"] assert plan.graph.nodes[-1].allowed_tool_names == [] ``` - [ ] **Step 2: Run the targeted test to verify failure** Run: `cd app-instance/backend && uv run pytest tests/unit/test_task_team_synthesis_outcome.py tests/unit/test_task_execution_planner.py -q` Expected: FAIL because outcome gate is absent. - [ ] **Step 3: Add deterministic incomplete output** Add `_team_synthesis_outcome(plan, result) -> tuple[str, str]`. Every `required_for_completion=True` node whose `completion_status` is not `succeeded` is incomplete. Context includes node id, status, error, and evidence gaps. Keep Team synthesis at `include_tools=False` and `max_tool_iterations=0`; prefix final output only when the incomplete notice is missing. Write `task_outcome` and `incomplete_node_ids` to `task_synthesis_completed`. Add `_finance_plan_json()` fixture with four task-oriented nodes and dependencies `collect -> extract -> validate -> report`. The report node explicitly uses `allowed_tool_names=[]`; source/extraction nodes request only `web_search`/`web_fetch`. Assert no node is named `researcher`, `writer`, or `reviewer`. The report node may emit a comparison table, chart-ready data, Mermaid chart, Markdown chart section, or text-bar-chart fallback. It must not claim an image/file chart artifact unless a registered chart-renderer tool exists and passes policy. - [ ] **Step 4: Run complete backend unit suite** Run: `cd app-instance/backend && uv run pytest tests/unit -q` Expected: PASS. Fix only compatibility defects in this plan; do not change router, persistent agent registry, frontend, nested-team behavior, or Skill-learning eval semantics. - [ ] **Step 5: Stop and report; do not commit** Report the modified files, complete unit-suite result, `git diff --stat`, and all remaining boundaries. Do not commit unless explicitly asked. ## Plan Self-Review - Coverage: parser compatibility, one-primary-template adaptation/repair, `None`/`[]`/allowlist scope semantics, interim high-risk filtering, partial propagation, coarse evidence completion, deterministic synthesis, and finance acceptance all have explicit tasks. - Exclusions: no fixed role Agents, parallel Team model, nested graph execution, chart renderer, high-risk approval UI, frontend work, or Skill-eval redesign appears in the implementation scope. - Compatibility: all new graph fields are defaults-only; `allowed_tool_names=None` preserves legacy behavior, `[]` explicitly disables tools, and evidence gating activates only when `required_evidence` is declared.