Files

steven_li 53b13e8eac feat(tasks): add skill-templated task graph execution

2026-06-23 10:22:58 +08:00

12 KiB

Raw Blame History

Skill-Templated Task Graph Design

Status

Approved for implementation planning on 2026-06-22. This document records the design only; it does not change runtime code.

Decision

Beaver Agent Team remains a temporary, task-oriented execution graph. It is not a collection of persistent specialist roles. The implementation extends the existing ExecutionGraph, ExecutionNode, LocalAgentRunner, and TeamGraphScheduler; it does not add a parallel Team model or fixed Researcher/Writer/Reviewer classes.

Task + activated Skills + runtime tool policy
  -> Planner adapts an optional Skill template
  -> validated ExecutionGraph
  -> generic workers execute nodes under node constraints
  -> evidence-aware completion gate
  -> tool-free final synthesis from node evidence

Existing Baseline

TaskAttemptOrchestrator currently preselects Skills, invokes TaskExecutionPlanner, executes an optional graph through TeamService, and runs the main agent as final synthesis. ExecutionGraph already validates sequence, parallel, and DAG dependencies. Each node is a generic LocalAgentRunner invocation of the shared AgentLoop; planner-created nodes have an empty role.

RunEvidence and ToolEvidence already capture transcripts and tool results. The gap is semantic: a node is currently successful when its finish_reason is stop, even if its task contract requires evidence and none was produced.

Skills currently have simple Markdown frontmatter plus body text and optional tool hints. The catalog parser deliberately has no general YAML dependency. Tool assembly currently selects always-on tools, Skill hints, and semantic-retrieval matches; it is not an execution-time node allowlist. Skill safety checks protect draft publication, not task execution.

Scope

In scope:

optional Skill planning templates;
adaptive minimal graph planning and planner repair;
node contracts, node tool scopes, evidence requirements, and completion states;
deterministic handling of unknown and high-risk tool hints;
grounded synthesis status and audit events;
unit and integration coverage.

Out of scope:

persistent role Agent classes or a role marketplace;
a second graph/model hierarchy beside ExecutionGraph and ExecutionNode;
recursive or unlimited nested teams;
a distributed worker system;
a high-risk approval UI or new approval API;
chart-image rendering.

The current runtime registers web_search and web_fetch but no chart renderer. The finance acceptance case may produce an evidence-backed comparison table, chart-ready data, Mermaid chart, Markdown chart section, text-bar-chart fallback, and final textual report. It must not claim that an image/file chart artifact was generated unless a registered chart-renderer tool exists and passes runtime safety policy.

Data Model Evolution

ExecutionGraph remains the graph model. ExecutionNode gains optional defaults-only fields:

input_contract: dict[str, object] = field(default_factory=dict)
output_contract: dict[str, object] = field(default_factory=dict)
allowed_tool_names: list[str] | None = None
required_evidence: list[str] = field(default_factory=list)
evidence_contract: dict[str, Any] = field(default_factory=dict)
validation_rules: list[str] = field(default_factory=list)
required_for_completion: bool = True
block_downstream_on_partial: bool = False
max_tool_iterations: int | None = None

allowed_tool_names has three non-overlapping meanings: None means node-level tool scope is disabled and retains legacy tool selection; [] explicitly prohibits every tool for this node; a populated list permits only those registered, policy-allowed tools. Existing callers retain behavior because the default is None.

NodeRunResult remains the node-output container. It gains completion_status (succeeded, partial, failed, or blocked) and evidence_gaps. success remains a compatibility field. Nodes without required_evidence retain the current finish_reason == "stop" success behavior. For a node that declares evidence requirements, a completed run with missing required evidence becomes partial and has success=False.

failed and blocked always block dependent nodes. partial does not imply successful completion, but its output and evidence remain consumable by downstream nodes unless block_downstream_on_partial=True. Any required-for-completion node that is partial still forces the final task outcome to incomplete.

TaskExecutionPlan gains a planner-adaptation payload rather than a duplicate graph object. The payload records template source/version, whether it was used, added/removed/merged node ids, removed tool names, warnings, and fallback reason. It is written into the existing task_execution_planned event.

No database migration is required in v1: graph and reports are transient execution state, while task/session events already persist plan metadata.

Skill Template Format

An optional beaver-team-template JSON fenced block is added to the Skill body:

## Team Planning Template

```beaver-team-template
{
  "version": 1,
  "team_when": ["multiple official sources require comparison"],
  "default_strategy": "dag",
  "nodes": [
    {
      "node_id": "collect_official_sources",
      "task": "Collect official primary sources for the requested entities.",
      "allowed_tools": ["web_search", "web_fetch"],
      "required_evidence": ["tool_result"],
      "required_for_completion": true
    }
  ]
}
```

The parser uses json.loads only. It returns an absent-template result when the block is missing and a warning result when JSON is malformed, duplicated, or structurally invalid. A parsing warning never prevents normal Skill activation. Existing SKILL.md files remain valid without migration.

The template is an LLM input, not an executable workflow. It supplies candidate nodes and constraints. The Planner may remove unnecessary nodes, merge trivial nodes, add essential validation, or choose single mode. It may not add an unregistered tool, bypass node tool policy, exceed graph limits, or convert a node into a role Agent.

Planner Design

TaskAttemptOrchestrator passes activated SkillContext objects to the planner rather than only truncated summaries. v1 supports one primary applicable Skill Team Template; other activated Skills remain ordinary guidance. Template composition, sub-skill guidance composition, and multi-Skill planning are explicitly deferred rather than prohibited long-term.

Planner output uses a task-only JSON schema. It contains mode, reason, strategy, nodes, final_synthesis_instruction, and adaptation. Nodes contain task, dependencies, contracts, requested tools, evidence requirements, validation rules, and completion importance. agent and role are not accepted as planner schema fields; the adapter creates the existing empty-role AgentDescriptor itself.

Validation is layered:

extract a JSON object from the LLM response;
validate scalar/list/object shapes and allowed keys;
resolve requested tools against registry and policy;
construct ExecutionGraph and validate node count, depth, dependencies, and cycles;
if invalid, make one no-tools repair request containing the validation errors;
if repair fails, use the existing safe single-mode fallback.

Single mode remains the default for an obvious one-step request. Template presence is a reason to ask the planner, not a reason to force team mode. Existing environment disablement (BEAVER_AGENT_TEAM_ENABLED) remains authoritative.

Tool Policy and Safety

For a Team node, the final allowlist is:

template/node requested names
∩ registered tools
∩ node runtime policy

Skill hints are suggestions, not authority. The current code has no populated task-time user/workspace permission model, so v1 must not claim that it enforces one. v1 uses a conservative interim tool-risk policy, not a complete task-time permission system. Until ToolSpec.metadata has stable fields such as risk_level, mutating, external_side_effect, requires_approval, and readonly, the interim policy uses a conservative name-based high-risk set such as terminal, execute_command, write_file, delete_file, external_send, and send_email.

unknown names are removed and reported as planner warnings;
read-only tools may remain available when the node requests them;
high-risk/mutating names are removed by default and recorded as requires_high_risk_review;
no node receives a broad tool set merely because a Skill hinted it.

Provider schemas are filtered to the allowlist, and ToolExecutor performs a second allowlist check through ToolContext.metadata. This prevents a model-originated call to a registered but unexposed tool from executing.

A real high-risk runtime approval flow requires a task lifecycle state and UI/API confirmation. It is out of scope; v1 removes high-risk names, records requires_high_risk_review, and explains the limitation rather than auto-approving.

Runtime and Evidence Semantics

DelegationEnvelope receives node contracts, allowed tools, evidence requirements, and per-node tool budget. LocalAgentRunner passes the allowed tools and budget into the current AgentLoop, builds existing RunEvidence, and classifies completion.

required_evidence in v1 is a coarse node-level completion gate, not a field-level evidence contract. It can show that a node produced at least one URL or tool result; it cannot prove that every required company, reporting period, metric, and source is present. evidence_contract: dict[str, Any] is reserved for a later field-level contract and is not interpreted in v1.

The coarse requirements have deterministic meanings in v1:

tool_result: at least one successful tool result;
url: at least one tool result with a URL;
output: non-empty node output;
any other declared value: explicit evidence gap.

The scheduler keeps sequence/parallel/DAG semantics. Dependencies never run after an upstream failed or blocked result. A partial upstream result is passed onward as partial evidence by default; a node can opt into blocking it with block_downstream_on_partial=True. The scheduler does not retry, recursively expand Skills, or create another Team graph.

Before final synthesis, TaskAttemptOrchestrator derives a task outcome:

complete: every required-for-completion node succeeded;
incomplete: any required node is partial, failed, or blocked, even if downstream synthesis produced a useful partial report;
single: no Team graph ran.

Team synthesis continues to run with no tools. For incomplete, the synthesis context lists completed work, node failures, evidence gaps, and the deterministic task outcome. The returned user-facing answer is prefixed with an incomplete notice if the model omits it, so runtime—not prompt compliance alone—prevents a false completion claim.

Nested Teams, Observability, and Compatibility

Nested graph execution is deferred. A node can resolve another Skill as guidance through the existing resolver, but cannot create a child ExecutionGraph. The runtime has no recursive budget ledger, tree-shaped evidence model, or UI fault navigator. Limiting v1 to a single graph keeps node failures attributable and cost bounded by existing node, parallel, and tool limits.

Existing task events receive the adaptation report, resolved tools, policy removals, completion status, and evidence gaps. Existing Skill learning/replay remains unchanged in v1. Template-specific scoring waits until execution semantics are stable.

Compatibility guarantees:

Skills without templates activate and execute unchanged.
Existing direct ExecutionGraph callers work because new fields have compatibility defaults; specifically, allowed_tool_names=None does not enable node-level scope and empty required_evidence does not enable evidence gating.
Single-agent runs do not receive node tool policies or outcome prefixes.
Existing external registry descriptors are not removed; planner-created Team nodes stay generic and role-empty.
TaskSkillResolver remains the per-node published-Skill/ephemeral-guidance fallback.

Verification Criteria

Implementation is accepted when tests prove that old Skills and single-agent tasks retain behavior; templates parse and degrade with warnings; planner emits only task-oriented generic workers; malformed output is repaired once or falls back; unknown/high-risk tools cannot execute; declared evidence controls node success; required-node failure forces incomplete synthesis; and the MGM/Galaxy case uses read-only web tools to produce evidence-backed comparison/report output without claiming a chart renderer.

12 KiB Raw Blame History