Files

steven_li 6843d89b2c docs: plan skill-templated task graphs

2026-06-22 11:51:48 +08:00

10 KiB

Raw Blame History

Skill-Templated Task Graph Design

Status

Approved for implementation planning on 2026-06-22. This document records the design only; it does not change runtime code.

Decision

Beaver Agent Team remains a temporary, task-oriented execution graph. It is not a collection of persistent specialist roles. The implementation extends the existing ExecutionGraph, ExecutionNode, LocalAgentRunner, and TeamGraphScheduler; it does not add a parallel Team model or fixed Researcher/Writer/Reviewer classes.

Task + activated Skills + runtime tool policy
  -> Planner adapts an optional Skill template
  -> validated ExecutionGraph
  -> generic workers execute nodes under node constraints
  -> evidence-aware completion gate
  -> tool-free final synthesis from node evidence

Existing Baseline

TaskAttemptOrchestrator currently preselects Skills, invokes TaskExecutionPlanner, executes an optional graph through TeamService, and runs the main agent as final synthesis. ExecutionGraph already validates sequence, parallel, and DAG dependencies. Each node is a generic LocalAgentRunner invocation of the shared AgentLoop; planner-created nodes have an empty role.

RunEvidence and ToolEvidence already capture transcripts and tool results. The gap is semantic: a node is currently successful when its finish_reason is stop, even if its task contract requires evidence and none was produced.

Skills currently have simple Markdown frontmatter plus body text and optional tool hints. The catalog parser deliberately has no general YAML dependency. Tool assembly currently selects always-on tools, Skill hints, and semantic-retrieval matches; it is not an execution-time node allowlist. Skill safety checks protect draft publication, not task execution.

Scope

In scope:

optional Skill planning templates;
adaptive minimal graph planning and planner repair;
node contracts, node tool scopes, evidence requirements, and completion states;
deterministic handling of unknown and high-risk tool hints;
grounded synthesis status and audit events;
unit and integration coverage.

Out of scope:

persistent role Agent classes or a role marketplace;
a second graph/model hierarchy beside ExecutionGraph and ExecutionNode;
recursive or unlimited nested teams;
a distributed worker system;
a high-risk approval UI or new approval API;
chart-image rendering.

The current runtime registers web_search and web_fetch but no chart renderer. The finance acceptance case therefore produces evidence-backed comparison data and a textual/Markdown report, not a fabricated chart artifact.

Data Model Evolution

ExecutionGraph remains the graph model. ExecutionNode gains optional defaults-only fields:

input_contract: dict[str, object] = field(default_factory=dict)
output_contract: dict[str, object] = field(default_factory=dict)
allowed_tool_names: list[str] = field(default_factory=list)
required_evidence: list[str] = field(default_factory=list)
validation_rules: list[str] = field(default_factory=list)
required_for_completion: bool = True
max_tool_iterations: int | None = None

Existing callers retain their behavior because empty lists and None impose no new node requirement.

NodeRunResult remains the node-output container. It gains completion_status (succeeded, partial, failed, or blocked) and evidence_gaps. success remains for scheduler compatibility and is true only for succeeded. A completed run with missing required evidence is therefore partial, and downstream dependencies block exactly as they do for failed nodes.

TaskExecutionPlan gains a planner-adaptation payload rather than a duplicate graph object. The payload records template source/version, whether it was used, added/removed/merged node ids, removed tool names, warnings, and fallback reason. It is written into the existing task_execution_planned event.

No database migration is required in v1: graph and reports are transient execution state, while task/session events already persist plan metadata.

Skill Template Format

An optional beaver-team-template JSON fenced block is added to the Skill body:

## Team Planning Template

```beaver-team-template
{
  "version": 1,
  "team_when": ["multiple official sources require comparison"],
  "default_strategy": "dag",
  "nodes": [
    {
      "node_id": "collect_official_sources",
      "task": "Collect official primary sources for the requested entities.",
      "allowed_tools": ["web_search", "web_fetch"],
      "required_evidence": ["tool_result"],
      "required_for_completion": true
    }
  ]
}
```

The parser uses json.loads only. It returns an absent-template result when the block is missing and a warning result when JSON is malformed, duplicated, or structurally invalid. A parsing warning never prevents normal Skill activation. Existing SKILL.md files remain valid without migration.

The template is an LLM input, not an executable workflow. It supplies candidate nodes and constraints. The Planner may remove unnecessary nodes, merge trivial nodes, add essential validation, or choose single mode. It may not add an unregistered tool, bypass node tool policy, exceed graph limits, or convert a node into a role Agent.

Planner Design

TaskAttemptOrchestrator passes activated SkillContext objects to the planner rather than only truncated summaries. The planner chooses at most one applicable template for the first implementation; multiple activated Skills remain ordinary guidance. This avoids composing incompatible templates before there is evidence for a composition model.

Planner output uses a task-only JSON schema. It contains mode, reason, strategy, nodes, final_synthesis_instruction, and adaptation. Nodes contain task, dependencies, contracts, requested tools, evidence requirements, validation rules, and completion importance. agent and role are not accepted as planner schema fields; the adapter creates the existing empty-role AgentDescriptor itself.

Validation is layered:

extract a JSON object from the LLM response;
validate scalar/list/object shapes and allowed keys;
resolve requested tools against registry and policy;
construct ExecutionGraph and validate node count, depth, dependencies, and cycles;
if invalid, make one no-tools repair request containing the validation errors;
if repair fails, use the existing safe single-mode fallback.

Single mode remains the default for an obvious one-step request. Template presence is a reason to ask the planner, not a reason to force team mode. Existing environment disablement (BEAVER_AGENT_TEAM_ENABLED) remains authoritative.

Tool Policy and Safety

For a Team node, the final allowlist is:

template/node requested names
∩ registered tools
∩ node runtime policy

Skill hints are suggestions, not authority. The current code has no populated task-time user/workspace permission model, so v1 must not claim that it enforces one. It uses a conservative node runtime policy:

unknown names are removed and reported as planner warnings;
read-only tools may remain available when the node requests them;
high-risk/mutating names are removed by default and recorded as requires_high_risk_review;
no node receives a broad tool set merely because a Skill hinted it.

Provider schemas are filtered to the allowlist, and ToolExecutor performs a second allowlist check through ToolContext.metadata. This prevents a model-originated call to a registered but unexposed tool from executing.

A real high-risk approval flow requires a task lifecycle state and UI/API confirmation. It is deferred; v1 blocks and explains rather than auto-approving.

Runtime and Evidence Semantics

DelegationEnvelope receives node contracts, allowed tools, evidence requirements, and per-node tool budget. LocalAgentRunner passes the allowed tools and budget into the current AgentLoop, builds existing RunEvidence, and classifies completion.

Evidence requirements have deterministic meanings in v1:

tool_result: at least one successful tool result;
url: at least one tool result with a URL;
output: non-empty node output;
any other declared value: explicit evidence gap.

The scheduler keeps sequence/parallel/DAG semantics. Dependencies only receive succeeded upstream results. It does not retry, recursively expand Skills, or create another Team graph.

Before final synthesis, TaskAttemptOrchestrator derives a task outcome:

complete: every required-for-completion node succeeded;
incomplete: any required node is partial, failed, or blocked;
single: no Team graph ran.

Team synthesis continues to run with no tools. For incomplete, the synthesis context lists completed work, node failures, evidence gaps, and the deterministic task outcome. The returned user-facing answer is prefixed with an incomplete notice if the model omits it, so runtime—not prompt compliance alone—prevents a false completion claim.

Nested Teams, Observability, and Compatibility

Nested graph execution is deferred. A node can resolve another Skill as guidance through the existing resolver, but cannot create a child ExecutionGraph. The runtime has no recursive budget ledger, tree-shaped evidence model, or UI fault navigator. Limiting v1 to a single graph keeps node failures attributable and cost bounded by existing node, parallel, and tool limits.

Existing task events receive the adaptation report, resolved tools, policy removals, completion status, and evidence gaps. Existing Skill learning/replay remains unchanged in v1. Template-specific scoring waits until execution semantics are stable.

Compatibility guarantees:

Skills without templates activate and execute unchanged.
Existing direct ExecutionGraph callers work because new fields have defaults.
Single-agent runs do not receive node tool policies or outcome prefixes.
Existing external registry descriptors are not removed; planner-created Team nodes stay generic and role-empty.
TaskSkillResolver remains the per-node published-Skill/ephemeral-guidance fallback.

Verification Criteria

Implementation is accepted when tests prove that old Skills and single-agent tasks retain behavior; templates parse and degrade with warnings; planner emits only task-oriented generic workers; malformed output is repaired once or falls back; unknown/high-risk tools cannot execute; declared evidence controls node success; required-node failure forces incomplete synthesis; and the MGM/Galaxy case uses read-only web tools to produce evidence-backed comparison/report output without claiming a chart renderer.

10 KiB Raw Blame History