feat(tasks): add skill-templated task graph execution

2026-06-23 10:22:58 +08:00
parent 6843d89b2c
commit 53b13e8eac
53 changed files with 4773 additions and 756 deletions
--- a/docs/superpowers/specs/2026-06-22-skill-templated-task-graph-design.md
+++ b/docs/superpowers/specs/2026-06-22-skill-templated-task-graph-design.md
@ -45,7 +45,7 @@ Out of scope:
 - a high-risk approval UI or new approval API;
 - chart-image rendering.

-The current runtime registers `web_search` and `web_fetch` but no chart renderer. The finance acceptance case therefore produces evidence-backed comparison data and a textual/Markdown report, not a fabricated chart artifact.
+The current runtime registers `web_search` and `web_fetch` but no chart renderer. The finance acceptance case may produce an evidence-backed comparison table, chart-ready data, Mermaid chart, Markdown chart section, text-bar-chart fallback, and final textual report. It must not claim that an image/file chart artifact was generated unless a registered chart-renderer tool exists and passes runtime safety policy.

 ## Data Model Evolution

@ -54,16 +54,20 @@ The current runtime registers `web_search` and `web_fetch` but no chart renderer
 ```python
 input_contract: dict[str, object] = field(default_factory=dict)
 output_contract: dict[str, object] = field(default_factory=dict)
-allowed_tool_names: list[str] = field(default_factory=list)
+allowed_tool_names: list[str] | None = None
 required_evidence: list[str] = field(default_factory=list)
+evidence_contract: dict[str, Any] = field(default_factory=dict)
 validation_rules: list[str] = field(default_factory=list)
 required_for_completion: bool = True
+block_downstream_on_partial: bool = False
 max_tool_iterations: int | None = None
 ```

-Existing callers retain their behavior because empty lists and `None` impose no new node requirement.
+`allowed_tool_names` has three non-overlapping meanings: `None` means node-level tool scope is disabled and retains legacy tool selection; `[]` explicitly prohibits every tool for this node; a populated list permits only those registered, policy-allowed tools. Existing callers retain behavior because the default is `None`.

-`NodeRunResult` remains the node-output container. It gains `completion_status` (`succeeded`, `partial`, `failed`, or `blocked`) and `evidence_gaps`. `success` remains for scheduler compatibility and is true only for `succeeded`. A completed run with missing required evidence is therefore `partial`, and downstream dependencies block exactly as they do for failed nodes.
+`NodeRunResult` remains the node-output container. It gains `completion_status` (`succeeded`, `partial`, `failed`, or `blocked`) and `evidence_gaps`. `success` remains a compatibility field. Nodes without `required_evidence` retain the current `finish_reason == "stop"` success behavior. For a node that declares evidence requirements, a completed run with missing required evidence becomes `partial` and has `success=False`.
+
+`failed` and `blocked` always block dependent nodes. `partial` does not imply successful completion, but its output and evidence remain consumable by downstream nodes unless `block_downstream_on_partial=True`. Any required-for-completion node that is partial still forces the final task outcome to `incomplete`.

 `TaskExecutionPlan` gains a planner-adaptation payload rather than a duplicate graph object. The payload records template source/version, whether it was used, added/removed/merged node ids, removed tool names, warnings, and fallback reason. It is written into the existing `task_execution_planned` event.

@ -100,7 +104,7 @@ The template is an LLM input, not an executable workflow. It supplies candidate

 ## Planner Design

-`TaskAttemptOrchestrator` passes activated `SkillContext` objects to the planner rather than only truncated summaries. The planner chooses at most one applicable template for the first implementation; multiple activated Skills remain ordinary guidance. This avoids composing incompatible templates before there is evidence for a composition model.
+`TaskAttemptOrchestrator` passes activated `SkillContext` objects to the planner rather than only truncated summaries. v1 supports one primary applicable Skill Team Template; other activated Skills remain ordinary guidance. Template composition, sub-skill guidance composition, and multi-Skill planning are explicitly deferred rather than prohibited long-term.

 Planner output uses a task-only JSON schema. It contains `mode`, `reason`, `strategy`, `nodes`, `final_synthesis_instruction`, and `adaptation`. Nodes contain task, dependencies, contracts, requested tools, evidence requirements, validation rules, and completion importance. `agent` and `role` are not accepted as planner schema fields; the adapter creates the existing empty-role `AgentDescriptor` itself.

@ -125,7 +129,7 @@ template/node requested names
 ∩ node runtime policy
 ```

-Skill hints are suggestions, not authority. The current code has no populated task-time user/workspace permission model, so v1 must not claim that it enforces one. It uses a conservative node runtime policy:
+Skill hints are suggestions, not authority. The current code has no populated task-time user/workspace permission model, so v1 must not claim that it enforces one. v1 uses a conservative interim tool-risk policy, not a complete task-time permission system. Until `ToolSpec.metadata` has stable fields such as `risk_level`, `mutating`, `external_side_effect`, `requires_approval`, and `readonly`, the interim policy uses a conservative name-based high-risk set such as `terminal`, `execute_command`, `write_file`, `delete_file`, `external_send`, and `send_email`.

 - unknown names are removed and reported as planner warnings;
 - read-only tools may remain available when the node requests them;
@ -134,25 +138,27 @@ Skill hints are suggestions, not authority. The current code has no populated ta

 Provider schemas are filtered to the allowlist, and `ToolExecutor` performs a second allowlist check through `ToolContext.metadata`. This prevents a model-originated call to a registered but unexposed tool from executing.

-A real high-risk approval flow requires a task lifecycle state and UI/API confirmation. It is deferred; v1 blocks and explains rather than auto-approving.
+A real high-risk runtime approval flow requires a task lifecycle state and UI/API confirmation. It is out of scope; v1 removes high-risk names, records `requires_high_risk_review`, and explains the limitation rather than auto-approving.

 ## Runtime and Evidence Semantics

 `DelegationEnvelope` receives node contracts, allowed tools, evidence requirements, and per-node tool budget. `LocalAgentRunner` passes the allowed tools and budget into the current `AgentLoop`, builds existing `RunEvidence`, and classifies completion.

-Evidence requirements have deterministic meanings in v1:
+`required_evidence` in v1 is a coarse node-level completion gate, not a field-level evidence contract. It can show that a node produced at least one URL or tool result; it cannot prove that every required company, reporting period, metric, and source is present. `evidence_contract: dict[str, Any]` is reserved for a later field-level contract and is not interpreted in v1.
+
+The coarse requirements have deterministic meanings in v1:

 - `tool_result`: at least one successful tool result;
 - `url`: at least one tool result with a URL;
 - `output`: non-empty node output;
 - any other declared value: explicit evidence gap.

-The scheduler keeps sequence/parallel/DAG semantics. Dependencies only receive succeeded upstream results. It does not retry, recursively expand Skills, or create another Team graph.
+The scheduler keeps sequence/parallel/DAG semantics. Dependencies never run after an upstream `failed` or `blocked` result. A `partial` upstream result is passed onward as partial evidence by default; a node can opt into blocking it with `block_downstream_on_partial=True`. The scheduler does not retry, recursively expand Skills, or create another Team graph.

 Before final synthesis, `TaskAttemptOrchestrator` derives a task outcome:

 - `complete`: every required-for-completion node succeeded;
- `incomplete`: any required node is partial, failed, or blocked;
+- `incomplete`: any required node is partial, failed, or blocked, even if downstream synthesis produced a useful partial report;
 - `single`: no Team graph ran.

 Team synthesis continues to run with no tools. For `incomplete`, the synthesis context lists completed work, node failures, evidence gaps, and the deterministic task outcome. The returned user-facing answer is prefixed with an incomplete notice if the model omits it, so runtime—not prompt compliance alone—prevents a false completion claim.
@ -166,7 +172,7 @@ Existing task events receive the adaptation report, resolved tools, policy remov
 Compatibility guarantees:

 - Skills without templates activate and execute unchanged.
- Existing direct `ExecutionGraph` callers work because new fields have defaults.
+- Existing direct `ExecutionGraph` callers work because new fields have compatibility defaults; specifically, `allowed_tool_names=None` does not enable node-level scope and empty `required_evidence` does not enable evidence gating.
 - Single-agent runs do not receive node tool policies or outcome prefixes.
 - Existing external registry descriptors are not removed; planner-created Team nodes stay generic and role-empty.
 - `TaskSkillResolver` remains the per-node published-Skill/ephemeral-guidance fallback.