beaver_project/docs/superpowers/specs/2026-06-22-skill-templated-task-graph-design.md

# Skill-Templated Task Graph Design

## Status

Approved for implementation planning on 2026-06-22. This document records the design only; it does not change runtime code.

## Decision

Beaver Agent Team remains a temporary, task-oriented execution graph. It is not a collection of persistent specialist roles. The implementation extends the existing `ExecutionGraph`, `ExecutionNode`, `LocalAgentRunner`, and `TeamGraphScheduler`; it does not add a parallel Team model or fixed Researcher/Writer/Reviewer classes.

```text
Task + activated Skills + runtime tool policy
  -> Planner adapts an optional Skill template
  -> validated ExecutionGraph
  -> generic workers execute nodes under node constraints
  -> evidence-aware completion gate
  -> tool-free final synthesis from node evidence
```

## Existing Baseline

`TaskAttemptOrchestrator` currently preselects Skills, invokes `TaskExecutionPlanner`, executes an optional graph through `TeamService`, and runs the main agent as final synthesis. `ExecutionGraph` already validates sequence, parallel, and DAG dependencies. Each node is a generic `LocalAgentRunner` invocation of the shared `AgentLoop`; planner-created nodes have an empty role.

`RunEvidence` and `ToolEvidence` already capture transcripts and tool results. The gap is semantic: a node is currently successful when its `finish_reason` is `stop`, even if its task contract requires evidence and none was produced.

Skills currently have simple Markdown frontmatter plus body text and optional tool hints. The catalog parser deliberately has no general YAML dependency. Tool assembly currently selects always-on tools, Skill hints, and semantic-retrieval matches; it is not an execution-time node allowlist. Skill safety checks protect draft publication, not task execution.

## Scope

In scope:

- optional Skill planning templates;
- adaptive minimal graph planning and planner repair;
- node contracts, node tool scopes, evidence requirements, and completion states;
- deterministic handling of unknown and high-risk tool hints;
- grounded synthesis status and audit events;
- unit and integration coverage.

Out of scope:

- persistent role Agent classes or a role marketplace;
- a second graph/model hierarchy beside `ExecutionGraph` and `ExecutionNode`;
- recursive or unlimited nested teams;
- a distributed worker system;
- a high-risk approval UI or new approval API;
- chart-image rendering.

The current runtime registers `web_search` and `web_fetch` but no chart renderer. The finance acceptance case may produce an evidence-backed comparison table, chart-ready data, Mermaid chart, Markdown chart section, text-bar-chart fallback, and final textual report. It must not claim that an image/file chart artifact was generated unless a registered chart-renderer tool exists and passes runtime safety policy.

## Data Model Evolution

`ExecutionGraph` remains the graph model. `ExecutionNode` gains optional defaults-only fields:

```python
input_contract: dict[str, object] = field(default_factory=dict)
output_contract: dict[str, object] = field(default_factory=dict)
allowed_tool_names: list[str] | None = None
required_evidence: list[str] = field(default_factory=list)
evidence_contract: dict[str, Any] = field(default_factory=dict)
validation_rules: list[str] = field(default_factory=list)
required_for_completion: bool = True
block_downstream_on_partial: bool = False
max_tool_iterations: int | None = None
```

`allowed_tool_names` has three non-overlapping meanings: `None` means node-level tool scope is disabled and retains legacy tool selection; `[]` explicitly prohibits every tool for this node; a populated list permits only those registered, policy-allowed tools. Existing callers retain behavior because the default is `None`.

`NodeRunResult` remains the node-output container. It gains `completion_status` (`succeeded`, `partial`, `failed`, or `blocked`) and `evidence_gaps`. `success` remains a compatibility field. Nodes without `required_evidence` retain the current `finish_reason == "stop"` success behavior. For a node that declares evidence requirements, a completed run with missing required evidence becomes `partial` and has `success=False`.

`failed` and `blocked` always block dependent nodes. `partial` does not imply successful completion, but its output and evidence remain consumable by downstream nodes unless `block_downstream_on_partial=True`. Any required-for-completion node that is partial still forces the final task outcome to `incomplete`.

`TaskExecutionPlan` gains a planner-adaptation payload rather than a duplicate graph object. The payload records template source/version, whether it was used, added/removed/merged node ids, removed tool names, warnings, and fallback reason. It is written into the existing `task_execution_planned` event.

No database migration is required in v1: graph and reports are transient execution state, while task/session events already persist plan metadata.

## Skill Template Format

An optional `beaver-team-template` JSON fenced block is added to the Skill body:

````md
## Team Planning Template

```beaver-team-template
{
  "version": 1,
  "team_when": ["multiple official sources require comparison"],
  "default_strategy": "dag",
  "nodes": [
    {
      "node_id": "collect_official_sources",
      "task": "Collect official primary sources for the requested entities.",
      "allowed_tools": ["web_search", "web_fetch"],
      "required_evidence": ["tool_result"],
      "required_for_completion": true
    }
  ]
}
```
````

The parser uses `json.loads` only. It returns an absent-template result when the block is missing and a warning result when JSON is malformed, duplicated, or structurally invalid. A parsing warning never prevents normal Skill activation. Existing `SKILL.md` files remain valid without migration.

The template is an LLM input, not an executable workflow. It supplies candidate nodes and constraints. The Planner may remove unnecessary nodes, merge trivial nodes, add essential validation, or choose single mode. It may not add an unregistered tool, bypass node tool policy, exceed graph limits, or convert a node into a role Agent.

## Planner Design

`TaskAttemptOrchestrator` passes activated `SkillContext` objects to the planner rather than only truncated summaries. v1 supports one primary applicable Skill Team Template; other activated Skills remain ordinary guidance. Template composition, sub-skill guidance composition, and multi-Skill planning are explicitly deferred rather than prohibited long-term.

Planner output uses a task-only JSON schema. It contains `mode`, `reason`, `strategy`, `nodes`, `final_synthesis_instruction`, and `adaptation`. Nodes contain task, dependencies, contracts, requested tools, evidence requirements, validation rules, and completion importance. `agent` and `role` are not accepted as planner schema fields; the adapter creates the existing empty-role `AgentDescriptor` itself.

Validation is layered:

1. extract a JSON object from the LLM response;
2. validate scalar/list/object shapes and allowed keys;
3. resolve requested tools against registry and policy;
4. construct `ExecutionGraph` and validate node count, depth, dependencies, and cycles;
5. if invalid, make one no-tools repair request containing the validation errors;
6. if repair fails, use the existing safe single-mode fallback.

Single mode remains the default for an obvious one-step request. Template presence is a reason to ask the planner, not a reason to force team mode. Existing environment disablement (`BEAVER_AGENT_TEAM_ENABLED`) remains authoritative.

## Tool Policy and Safety

For a Team node, the final allowlist is:

```text
template/node requested names
∩ registered tools
∩ node runtime policy
```

Skill hints are suggestions, not authority. The current code has no populated task-time user/workspace permission model, so v1 must not claim that it enforces one. v1 uses a conservative interim tool-risk policy, not a complete task-time permission system. Until `ToolSpec.metadata` has stable fields such as `risk_level`, `mutating`, `external_side_effect`, `requires_approval`, and `readonly`, the interim policy uses a conservative name-based high-risk set such as `terminal`, `execute_command`, `write_file`, `delete_file`, `external_send`, and `send_email`.

- unknown names are removed and reported as planner warnings;
- read-only tools may remain available when the node requests them;
- high-risk/mutating names are removed by default and recorded as `requires_high_risk_review`;
- no node receives a broad tool set merely because a Skill hinted it.

Provider schemas are filtered to the allowlist, and `ToolExecutor` performs a second allowlist check through `ToolContext.metadata`. This prevents a model-originated call to a registered but unexposed tool from executing.

A real high-risk runtime approval flow requires a task lifecycle state and UI/API confirmation. It is out of scope; v1 removes high-risk names, records `requires_high_risk_review`, and explains the limitation rather than auto-approving.

## Runtime and Evidence Semantics

`DelegationEnvelope` receives node contracts, allowed tools, evidence requirements, and per-node tool budget. `LocalAgentRunner` passes the allowed tools and budget into the current `AgentLoop`, builds existing `RunEvidence`, and classifies completion.

`required_evidence` in v1 is a coarse node-level completion gate, not a field-level evidence contract. It can show that a node produced at least one URL or tool result; it cannot prove that every required company, reporting period, metric, and source is present. `evidence_contract: dict[str, Any]` is reserved for a later field-level contract and is not interpreted in v1.

The coarse requirements have deterministic meanings in v1:

- `tool_result`: at least one successful tool result;
- `url`: at least one tool result with a URL;
- `output`: non-empty node output;
- any other declared value: explicit evidence gap.

The scheduler keeps sequence/parallel/DAG semantics. Dependencies never run after an upstream `failed` or `blocked` result. A `partial` upstream result is passed onward as partial evidence by default; a node can opt into blocking it with `block_downstream_on_partial=True`. The scheduler does not retry, recursively expand Skills, or create another Team graph.

Before final synthesis, `TaskAttemptOrchestrator` derives a task outcome:

- `complete`: every required-for-completion node succeeded;
- `incomplete`: any required node is partial, failed, or blocked, even if downstream synthesis produced a useful partial report;
- `single`: no Team graph ran.

Team synthesis continues to run with no tools. For `incomplete`, the synthesis context lists completed work, node failures, evidence gaps, and the deterministic task outcome. The returned user-facing answer is prefixed with an incomplete notice if the model omits it, so runtime—not prompt compliance alone—prevents a false completion claim.

## Nested Teams, Observability, and Compatibility

Nested graph execution is deferred. A node can resolve another Skill as guidance through the existing resolver, but cannot create a child `ExecutionGraph`. The runtime has no recursive budget ledger, tree-shaped evidence model, or UI fault navigator. Limiting v1 to a single graph keeps node failures attributable and cost bounded by existing node, parallel, and tool limits.

Existing task events receive the adaptation report, resolved tools, policy removals, completion status, and evidence gaps. Existing Skill learning/replay remains unchanged in v1. Template-specific scoring waits until execution semantics are stable.

Compatibility guarantees:

- Skills without templates activate and execute unchanged.
- Existing direct `ExecutionGraph` callers work because new fields have compatibility defaults; specifically, `allowed_tool_names=None` does not enable node-level scope and empty `required_evidence` does not enable evidence gating.
- Single-agent runs do not receive node tool policies or outcome prefixes.
- Existing external registry descriptors are not removed; planner-created Team nodes stay generic and role-empty.
- `TaskSkillResolver` remains the per-node published-Skill/ephemeral-guidance fallback.

## Verification Criteria

Implementation is accepted when tests prove that old Skills and single-agent tasks retain behavior; templates parse and degrade with warnings; planner emits only task-oriented generic workers; malformed output is repaired once or falls back; unknown/high-risk tools cannot execute; declared evidence controls node success; required-node failure forces incomplete synthesis; and the MGM/Galaxy case uses read-only web tools to produce evidence-backed comparison/report output without claiming a chart renderer.