docs: plan skill-templated task graphs

This commit is contained in:
2026-06-22 11:51:48 +08:00
parent 83d9d8c200
commit 6843d89b2c
2 changed files with 545 additions and 0 deletions

View File

@ -0,0 +1,369 @@
# Skill-Templated Task Graph Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Extend the existing Beaver task-team runtime so optional Skill templates guide minimal, generic-worker execution graphs with node tool scopes and evidence completion gates.
**Architecture:** `ExecutionGraph`, `ExecutionNode`, and `LocalAgentRunner` remain the only Team runtime model. An optional JSON block in an activated Skill becomes planner input; the planner validates and adapts it into the existing graph; workers enforce graph-provided tool/evidence contracts; tool-free synthesis receives a deterministic complete/incomplete outcome.
**Tech Stack:** Python 3.12, dataclasses, asyncio, pytest, existing Beaver catalog, engine, coordinator, and task runtime.
---
## File Structure
- `beaver/skills/catalog/utils.py`: optional template parser.
- `beaver/skills/catalog/loader.py`, `engine/context/builder.py`, `skills/assembler/task_assembler.py`: carry parse output into `SkillContext`.
- `beaver/coordinator/models.py`: defaults-only extensions to current graph/node/result contracts.
- `beaver/tasks/planner.py`: template-aware task-only JSON schema, repair, graph validation, and adaptation event metadata.
- `beaver/engine/loader.py`: inject registry into planner.
- `beaver/engine/loop.py`, `tools/runtime/executor.py`, `coordinator/local.py`: node allowlist and budget enforcement.
- `beaver/tasks/evidence.py`, `coordinator/execution/scheduler.py`, `tasks/attempt_orchestrator.py`: evidence completion and incomplete synthesis gate.
### Task 1: Parse and Propagate Optional Skill Templates
**Files:**
- Modify: `app-instance/backend/beaver/skills/catalog/utils.py`
- Modify: `app-instance/backend/beaver/skills/catalog/loader.py`
- Modify: `app-instance/backend/beaver/engine/context/builder.py`
- Modify: `app-instance/backend/beaver/skills/assembler/task_assembler.py`
- Create: `app-instance/backend/tests/unit/test_skill_team_template.py`
- [ ] **Step 1: Write failing parser tests**
```python
def test_extract_team_template_returns_none_when_block_is_absent() -> None:
result = extract_skill_team_template("# Ordinary Skill")
assert result.template is None
assert result.warnings == []
def test_extract_team_template_parses_valid_json_block() -> None:
result = extract_skill_team_template(
"```beaver-team-template\\n"
'{"version": 1, "nodes": [{"node_id": "collect", "task": "Collect"}]}\\n```'
)
assert result.template["nodes"][0]["node_id"] == "collect"
def test_invalid_template_is_warning_not_skill_load_failure() -> None:
result = extract_skill_team_template("```beaver-team-template\\nnot-json\\n```")
assert result.template is None
assert result.warnings == ["team template JSON is invalid"]
```
- [ ] **Step 2: Run it to verify failure**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_skill_team_template.py -q`
Expected: FAIL because `extract_skill_team_template` does not exist.
- [ ] **Step 3: Implement the parser and propagation**
```python
@dataclass(slots=True)
class SkillTeamTemplateParseResult:
template: dict[str, Any] | None = None
warnings: list[str] = field(default_factory=list)
def extract_skill_team_template(body: str) -> SkillTeamTemplateParseResult:
matches = re.findall(r"```beaver-team-template\\s*\\n(.*?)\\n```", body, re.DOTALL)
if not matches:
return SkillTeamTemplateParseResult()
if len(matches) != 1:
return SkillTeamTemplateParseResult(warnings=["skill defines multiple team templates"])
try:
value = json.loads(matches[0])
except json.JSONDecodeError:
return SkillTeamTemplateParseResult(warnings=["team template JSON is invalid"])
if not isinstance(value, dict) or not isinstance(value.get("nodes", []), list):
return SkillTeamTemplateParseResult(warnings=["team template must be an object with a nodes list"])
return SkillTeamTemplateParseResult(template=value)
```
Add defaults-only `team_template` and `team_template_warnings` fields to `SkillRecord` and `SkillContext`; populate them from stripped Skill body in loader/assembler. Keep the original Skill body available for normal prompt injection.
- [ ] **Step 4: Run parser and assembler regression tests**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_skill_team_template.py tests/unit/test_skill_assembler.py -q`
Expected: PASS.
- [ ] **Step 5: Commit**
Run: `git add app-instance/backend/beaver/skills/catalog/utils.py app-instance/backend/beaver/skills/catalog/loader.py app-instance/backend/beaver/engine/context/builder.py app-instance/backend/beaver/skills/assembler/task_assembler.py app-instance/backend/tests/unit/test_skill_team_template.py && git commit -m "feat(skills): parse optional task graph templates"`
### Task 2: Extend Existing Graph Contracts
**Files:**
- Modify: `app-instance/backend/beaver/coordinator/models.py`
- Modify: `app-instance/backend/tests/unit/test_agent_team_v1.py`
- [ ] **Step 1: Write failing contract and depth tests**
```python
def test_execution_node_contracts_default_for_existing_callers() -> None:
node = ExecutionNode("collect", "Collect", AgentDescriptor(name="collect"))
assert node.allowed_tool_names == []
assert node.required_evidence == []
assert node.required_for_completion is True
def test_graph_rejects_depth_above_configured_limit() -> None:
graph = ExecutionGraph(
strategy="dag",
nodes=[
ExecutionNode("a", "A", AgentDescriptor(name="a")),
ExecutionNode("b", "B", AgentDescriptor(name="b"), depends_on=["a"]),
ExecutionNode("c", "C", AgentDescriptor(name="c"), depends_on=["b"]),
],
)
with pytest.raises(ValueError, match="max depth"):
graph.validate(max_depth=2)
```
- [ ] **Step 2: Run it to verify failure**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_agent_team_v1.py -q`
Expected: FAIL because fields and `max_depth` do not exist.
- [ ] **Step 3: Add only defaults to the current dataclasses**
```python
input_contract: dict[str, Any] = field(default_factory=dict)
output_contract: dict[str, Any] = field(default_factory=dict)
allowed_tool_names: list[str] = field(default_factory=list)
required_evidence: list[str] = field(default_factory=list)
validation_rules: list[str] = field(default_factory=list)
required_for_completion: bool = True
max_tool_iterations: int | None = None
```
Add the runtime-relevant values to `DelegationEnvelope`. Add `completion_status="succeeded"` and `evidence_gaps=[]` to `NodeRunResult`. Extend `ExecutionGraph.validate(max_depth: int | None = None)` to calculate longest dependency chain with its existing DFS and raise only when an explicit limit is exceeded.
- [ ] **Step 4: Run the coordinator regression test**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_agent_team_v1.py -q`
Expected: PASS.
- [ ] **Step 5: Commit**
Run: `git add app-instance/backend/beaver/coordinator/models.py app-instance/backend/tests/unit/test_agent_team_v1.py && git commit -m "feat(team): add optional node contracts"`
### Task 3: Adapt Templates Into Generic Task Graphs
**Files:**
- Modify: `app-instance/backend/beaver/tasks/planner.py`
- Modify: `app-instance/backend/beaver/tasks/attempt_orchestrator.py`
- Modify: `app-instance/backend/beaver/engine/loader.py`
- Modify: `app-instance/backend/tests/unit/test_task_execution_planner.py`
- [ ] **Step 1: Write failing planner tests**
```python
def test_template_plan_creates_generic_worker_not_role_agent() -> None:
plan = TaskExecutionPlanner(tool_registry=_registry()).from_json(
'{"mode":"team","strategy":"dag","nodes":[{"node_id":"collect","task":"Collect",'
'"requested_tools":["web_search"]}],"adaptation":{"template_used":true}}'
)
node = plan.graph.nodes[0]
assert node.agent.role == ""
assert node.agent.metadata["sub_agent_kind"] == "generic_skill_worker"
assert plan.planner_adaptation["template_used"] is True
def test_unknown_tool_is_removed_and_warned() -> None:
plan = TaskExecutionPlanner(tool_registry=_registry()).from_json(
'{"mode":"team","strategy":"sequence","nodes":[{"node_id":"collect","task":"Collect",'
'"requested_tools":["web_search","not_real"]}]}'
)
assert plan.graph.nodes[0].allowed_tool_names == ["web_search"]
assert "unknown tool removed: not_real" in plan.planner_adaptation["warnings"]
```
- [ ] **Step 2: Run it to verify failure**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_task_execution_planner.py -q`
Expected: FAIL because planner has no template context, registry policy, or adaptation report.
- [ ] **Step 3: Implement task-only planner schema and one repair attempt**
Add `tool_registry: ToolRegistry | None` to `TaskExecutionPlanner`. Change `plan()` to receive `activated_skills: list[SkillContext]`, select at most one valid template, and include it in `_prompt`. Add `planner_adaptation: dict[str, Any] = field(default_factory=dict)` to `TaskExecutionPlan` and `to_event_payload()`.
Accept only `node_id`, `task`, `depends_on`, `input_contract`, `output_contract`, `requested_tools`, `required_evidence`, `validation_rules`, `required_for_completion`, `max_tool_iterations`, and `constraints`. Reject `agent` and `role`; construct `AgentDescriptor(name=node_id, role="", system_prompt="", metadata={"sub_agent_kind": "generic_skill_worker", ...})` internally.
Resolve requested names through registry plus conservative read-only policy. Write allowed names to `ExecutionNode.allowed_tool_names`; write unknown/high-risk removals into adaptation warnings. Validate node count, dependencies, cycles, and `graph.validate(max_depth=4)`. If first provider output is invalid, make exactly one `tools=None` repair request containing validation errors; if it is still invalid, return `TaskExecutionPlan.single("planner_fallback_single", fallback_error=...)`.
Update `TaskAttemptOrchestrator` to pass `preselected_skills`, and `EngineLoader` to construct planner with its registry.
- [ ] **Step 4: Run planner and task-mode regression tests**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_task_execution_planner.py tests/unit/test_task_mode_feedback.py -q`
Expected: PASS.
- [ ] **Step 5: Commit**
Run: `git add app-instance/backend/beaver/tasks/planner.py app-instance/backend/beaver/tasks/attempt_orchestrator.py app-instance/backend/beaver/engine/loader.py app-instance/backend/tests/unit/test_task_execution_planner.py && git commit -m "feat(tasks): adapt skill templates into task graphs"`
### Task 4: Enforce Node Tool Allowlists
**Files:**
- Modify: `app-instance/backend/beaver/engine/loop.py`
- Modify: `app-instance/backend/beaver/tools/runtime/executor.py`
- Modify: `app-instance/backend/beaver/coordinator/local.py`
- Modify: `app-instance/backend/tests/unit/test_agent_loop.py`
- Create: `app-instance/backend/tests/unit/test_team_node_tool_policy.py`
- [ ] **Step 1: Write failing schema and executor tests**
```python
def test_team_node_exposes_only_allowed_tool_schema() -> None:
asyncio.run(loop.process_direct("collect", allowed_tool_names=["web_search"]))
assert _tool_names(provider.calls[0]["tools"]) == ["web_search"]
def test_executor_rejects_registered_tool_outside_node_allowlist() -> None:
context = ToolContext(metadata={"allowed_tool_names": ["web_search"]})
result = asyncio.run(executor.execute("write_file", {"path": "x", "content": "x"}, context=context))
assert result.success is False
assert result.error == "tool_not_allowed"
```
- [ ] **Step 2: Run it to verify failure**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_team_node_tool_policy.py -q`
Expected: FAIL because allowlists are not accepted or enforced.
- [ ] **Step 3: Filter provider schemas and deny in executor**
Add `allowed_tool_names: list[str] | None = None` to `AgentLoop.process_direct()` and `_process_direct_impl()`. Filter assembled tool specs only when it is not `None`, and place it in `ToolContext.metadata`. Pass node scope and budget from `LocalAgentRunner`.
```python
allowed = context.metadata.get("allowed_tool_names") if context is not None else None
if isinstance(allowed, list) and tool_name not in allowed:
return ToolResult(False, f"Tool {tool_name} is not allowed for this node.", tool_name, "tool_not_allowed")
```
Keep `None` distinct from `[]`: `None` preserves current single-agent behavior; an empty Team-node list exposes no tools.
- [ ] **Step 4: Run focused and loop regression tests**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_team_node_tool_policy.py tests/unit/test_agent_loop.py -q`
Expected: PASS.
- [ ] **Step 5: Commit**
Run: `git add app-instance/backend/beaver/engine/loop.py app-instance/backend/beaver/tools/runtime/executor.py app-instance/backend/beaver/coordinator/local.py app-instance/backend/tests/unit/test_agent_loop.py app-instance/backend/tests/unit/test_team_node_tool_policy.py && git commit -m "feat(team): enforce node tool scopes"`
### Task 5: Gate Node Success on Required Evidence
**Files:**
- Modify: `app-instance/backend/beaver/tasks/evidence.py`
- Modify: `app-instance/backend/beaver/coordinator/local.py`
- Modify: `app-instance/backend/beaver/coordinator/execution/scheduler.py`
- Modify: `app-instance/backend/tests/unit/test_agent_team_v1.py`
- Modify: `app-instance/backend/tests/unit/test_task_evidence.py`
- [ ] **Step 1: Write failing evidence-completion tests**
```python
def test_node_without_required_tool_result_is_partial() -> None:
result = asyncio.run(runner.run(_envelope(required_evidence=["tool_result"])))
assert result.success is False
assert result.completion_status == "partial"
assert result.evidence_gaps == ["missing required evidence: tool_result"]
def test_dag_blocks_dependency_of_partial_required_node() -> None:
outcome = asyncio.run(scheduler.run(_graph_with_partial_collect_node(), parent_task_id=None, parent_session_id="s"))
assert outcome.node_results[1].finish_reason == "blocked"
```
- [ ] **Step 2: Run it to verify failure**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_agent_team_v1.py tests/unit/test_task_evidence.py -q`
Expected: FAIL because evidence requirements do not affect node success.
- [ ] **Step 3: Implement deterministic evidence checks**
Add `evaluate_node_evidence(evidence, required_evidence, output_text) -> list[str]`. `tool_result` requires a successful tool result, `url` a tool result URL, and `output` non-empty output; any other requirement produces `unsupported evidence requirement: <name>`. After `LocalAgentRunner` builds `RunEvidence`, set `completion_status="partial"`, `success=False`, and gaps when required evidence is absent. Scheduler-created error/blocked results set status to `failed`/`blocked` while retaining partial evidence.
- [ ] **Step 4: Run coordinator and evidence regression tests**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_agent_team_v1.py tests/unit/test_task_evidence.py -q`
Expected: PASS.
- [ ] **Step 5: Commit**
Run: `git add app-instance/backend/beaver/tasks/evidence.py app-instance/backend/beaver/coordinator/local.py app-instance/backend/beaver/coordinator/execution/scheduler.py app-instance/backend/tests/unit/test_agent_team_v1.py app-instance/backend/tests/unit/test_task_evidence.py && git commit -m "feat(team): require declared node evidence"`
### Task 6: Gate Final Synthesis and Verify Finance Planning
**Files:**
- Modify: `app-instance/backend/beaver/tasks/attempt_orchestrator.py`
- Modify: `app-instance/backend/tests/unit/test_task_mode_feedback.py`
- Create: `app-instance/backend/tests/unit/test_task_team_synthesis_outcome.py`
- Modify: `app-instance/backend/tests/unit/test_task_execution_planner.py`
- Modify: `app-instance/backend/tests/unit/test_task_skill_resolver.py`
- [ ] **Step 1: Write failing outcome and finance tests**
```python
def test_required_partial_node_marks_synthesis_incomplete() -> None:
context, prefix = orchestrator._team_synthesis_outcome(_plan(), _team_result(partial_required=True))
assert "Task outcome: incomplete" in context
assert prefix.startswith("任务未完成:")
def test_finance_template_adapts_to_read_only_task_graph() -> None:
plan = planner.from_json(_finance_plan_json())
assert [node.node_id for node in plan.graph.nodes] == [
"collect_official_sources", "extract_financial_metrics", "validate_metrics", "generate_chart_report"
]
assert all(node.agent.role == "" for node in plan.graph.nodes)
assert plan.graph.nodes[0].allowed_tool_names == ["web_search", "web_fetch"]
assert plan.graph.nodes[-1].allowed_tool_names == []
```
- [ ] **Step 2: Run the targeted test to verify failure**
Run: `cd app-instance/backend && uv run pytest tests/unit/test_task_team_synthesis_outcome.py tests/unit/test_task_execution_planner.py -q`
Expected: FAIL because outcome gate is absent.
- [ ] **Step 3: Add deterministic incomplete output**
Add `_team_synthesis_outcome(plan, result) -> tuple[str, str]`. Every `required_for_completion=True` node whose `completion_status` is not `succeeded` is incomplete. Context includes node id, status, error, and evidence gaps. Keep Team synthesis at `include_tools=False` and `max_tool_iterations=0`; prefix final output only when the incomplete notice is missing. Write `task_outcome` and `incomplete_node_ids` to `task_synthesis_completed`.
Add `_finance_plan_json()` fixture with four task-oriented nodes and dependencies `collect -> extract -> validate -> report`. Only source/extraction nodes request `web_search`/`web_fetch`; report node uses upstream evidence and produces Markdown/table/chart data, never an unregistered chart renderer. Assert no node is named `researcher`, `writer`, or `reviewer`.
- [ ] **Step 4: Run complete backend unit suite**
Run: `cd app-instance/backend && uv run pytest tests/unit -q`
Expected: PASS. Fix only compatibility defects in this plan; do not change router, persistent agent registry, frontend, nested-team behavior, or Skill-learning eval semantics.
- [ ] **Step 5: Commit**
Run: `git add app-instance/backend/beaver/tasks/attempt_orchestrator.py app-instance/backend/tests/unit/test_task_mode_feedback.py app-instance/backend/tests/unit/test_task_team_synthesis_outcome.py app-instance/backend/tests/unit/test_task_execution_planner.py app-instance/backend/tests/unit/test_task_skill_resolver.py && git commit -m "test(team): cover skill-templated finance planning"`
## Plan Self-Review
- Coverage: parser compatibility, existing graph contracts, template adaptation/repair, tool enforcement, evidence completion, deterministic synthesis, and finance acceptance all have explicit tasks.
- Exclusions: no fixed role Agents, parallel Team model, nested graph execution, chart renderer, high-risk approval UI, frontend work, or Skill-eval redesign appears in the implementation scope.
- Compatibility: all new graph fields are defaults-only; `None` tool scope preserves single-agent behavior, while `[]` gives a Team node no tools.

View File

@ -0,0 +1,176 @@
# Skill-Templated Task Graph Design
## Status
Approved for implementation planning on 2026-06-22. This document records the design only; it does not change runtime code.
## Decision
Beaver Agent Team remains a temporary, task-oriented execution graph. It is not a collection of persistent specialist roles. The implementation extends the existing `ExecutionGraph`, `ExecutionNode`, `LocalAgentRunner`, and `TeamGraphScheduler`; it does not add a parallel Team model or fixed Researcher/Writer/Reviewer classes.
```text
Task + activated Skills + runtime tool policy
-> Planner adapts an optional Skill template
-> validated ExecutionGraph
-> generic workers execute nodes under node constraints
-> evidence-aware completion gate
-> tool-free final synthesis from node evidence
```
## Existing Baseline
`TaskAttemptOrchestrator` currently preselects Skills, invokes `TaskExecutionPlanner`, executes an optional graph through `TeamService`, and runs the main agent as final synthesis. `ExecutionGraph` already validates sequence, parallel, and DAG dependencies. Each node is a generic `LocalAgentRunner` invocation of the shared `AgentLoop`; planner-created nodes have an empty role.
`RunEvidence` and `ToolEvidence` already capture transcripts and tool results. The gap is semantic: a node is currently successful when its `finish_reason` is `stop`, even if its task contract requires evidence and none was produced.
Skills currently have simple Markdown frontmatter plus body text and optional tool hints. The catalog parser deliberately has no general YAML dependency. Tool assembly currently selects always-on tools, Skill hints, and semantic-retrieval matches; it is not an execution-time node allowlist. Skill safety checks protect draft publication, not task execution.
## Scope
In scope:
- optional Skill planning templates;
- adaptive minimal graph planning and planner repair;
- node contracts, node tool scopes, evidence requirements, and completion states;
- deterministic handling of unknown and high-risk tool hints;
- grounded synthesis status and audit events;
- unit and integration coverage.
Out of scope:
- persistent role Agent classes or a role marketplace;
- a second graph/model hierarchy beside `ExecutionGraph` and `ExecutionNode`;
- recursive or unlimited nested teams;
- a distributed worker system;
- a high-risk approval UI or new approval API;
- chart-image rendering.
The current runtime registers `web_search` and `web_fetch` but no chart renderer. The finance acceptance case therefore produces evidence-backed comparison data and a textual/Markdown report, not a fabricated chart artifact.
## Data Model Evolution
`ExecutionGraph` remains the graph model. `ExecutionNode` gains optional defaults-only fields:
```python
input_contract: dict[str, object] = field(default_factory=dict)
output_contract: dict[str, object] = field(default_factory=dict)
allowed_tool_names: list[str] = field(default_factory=list)
required_evidence: list[str] = field(default_factory=list)
validation_rules: list[str] = field(default_factory=list)
required_for_completion: bool = True
max_tool_iterations: int | None = None
```
Existing callers retain their behavior because empty lists and `None` impose no new node requirement.
`NodeRunResult` remains the node-output container. It gains `completion_status` (`succeeded`, `partial`, `failed`, or `blocked`) and `evidence_gaps`. `success` remains for scheduler compatibility and is true only for `succeeded`. A completed run with missing required evidence is therefore `partial`, and downstream dependencies block exactly as they do for failed nodes.
`TaskExecutionPlan` gains a planner-adaptation payload rather than a duplicate graph object. The payload records template source/version, whether it was used, added/removed/merged node ids, removed tool names, warnings, and fallback reason. It is written into the existing `task_execution_planned` event.
No database migration is required in v1: graph and reports are transient execution state, while task/session events already persist plan metadata.
## Skill Template Format
An optional `beaver-team-template` JSON fenced block is added to the Skill body:
````md
## Team Planning Template
```beaver-team-template
{
"version": 1,
"team_when": ["multiple official sources require comparison"],
"default_strategy": "dag",
"nodes": [
{
"node_id": "collect_official_sources",
"task": "Collect official primary sources for the requested entities.",
"allowed_tools": ["web_search", "web_fetch"],
"required_evidence": ["tool_result"],
"required_for_completion": true
}
]
}
```
````
The parser uses `json.loads` only. It returns an absent-template result when the block is missing and a warning result when JSON is malformed, duplicated, or structurally invalid. A parsing warning never prevents normal Skill activation. Existing `SKILL.md` files remain valid without migration.
The template is an LLM input, not an executable workflow. It supplies candidate nodes and constraints. The Planner may remove unnecessary nodes, merge trivial nodes, add essential validation, or choose single mode. It may not add an unregistered tool, bypass node tool policy, exceed graph limits, or convert a node into a role Agent.
## Planner Design
`TaskAttemptOrchestrator` passes activated `SkillContext` objects to the planner rather than only truncated summaries. The planner chooses at most one applicable template for the first implementation; multiple activated Skills remain ordinary guidance. This avoids composing incompatible templates before there is evidence for a composition model.
Planner output uses a task-only JSON schema. It contains `mode`, `reason`, `strategy`, `nodes`, `final_synthesis_instruction`, and `adaptation`. Nodes contain task, dependencies, contracts, requested tools, evidence requirements, validation rules, and completion importance. `agent` and `role` are not accepted as planner schema fields; the adapter creates the existing empty-role `AgentDescriptor` itself.
Validation is layered:
1. extract a JSON object from the LLM response;
2. validate scalar/list/object shapes and allowed keys;
3. resolve requested tools against registry and policy;
4. construct `ExecutionGraph` and validate node count, depth, dependencies, and cycles;
5. if invalid, make one no-tools repair request containing the validation errors;
6. if repair fails, use the existing safe single-mode fallback.
Single mode remains the default for an obvious one-step request. Template presence is a reason to ask the planner, not a reason to force team mode. Existing environment disablement (`BEAVER_AGENT_TEAM_ENABLED`) remains authoritative.
## Tool Policy and Safety
For a Team node, the final allowlist is:
```text
template/node requested names
∩ registered tools
∩ node runtime policy
```
Skill hints are suggestions, not authority. The current code has no populated task-time user/workspace permission model, so v1 must not claim that it enforces one. It uses a conservative node runtime policy:
- unknown names are removed and reported as planner warnings;
- read-only tools may remain available when the node requests them;
- high-risk/mutating names are removed by default and recorded as `requires_high_risk_review`;
- no node receives a broad tool set merely because a Skill hinted it.
Provider schemas are filtered to the allowlist, and `ToolExecutor` performs a second allowlist check through `ToolContext.metadata`. This prevents a model-originated call to a registered but unexposed tool from executing.
A real high-risk approval flow requires a task lifecycle state and UI/API confirmation. It is deferred; v1 blocks and explains rather than auto-approving.
## Runtime and Evidence Semantics
`DelegationEnvelope` receives node contracts, allowed tools, evidence requirements, and per-node tool budget. `LocalAgentRunner` passes the allowed tools and budget into the current `AgentLoop`, builds existing `RunEvidence`, and classifies completion.
Evidence requirements have deterministic meanings in v1:
- `tool_result`: at least one successful tool result;
- `url`: at least one tool result with a URL;
- `output`: non-empty node output;
- any other declared value: explicit evidence gap.
The scheduler keeps sequence/parallel/DAG semantics. Dependencies only receive succeeded upstream results. It does not retry, recursively expand Skills, or create another Team graph.
Before final synthesis, `TaskAttemptOrchestrator` derives a task outcome:
- `complete`: every required-for-completion node succeeded;
- `incomplete`: any required node is partial, failed, or blocked;
- `single`: no Team graph ran.
Team synthesis continues to run with no tools. For `incomplete`, the synthesis context lists completed work, node failures, evidence gaps, and the deterministic task outcome. The returned user-facing answer is prefixed with an incomplete notice if the model omits it, so runtime—not prompt compliance alone—prevents a false completion claim.
## Nested Teams, Observability, and Compatibility
Nested graph execution is deferred. A node can resolve another Skill as guidance through the existing resolver, but cannot create a child `ExecutionGraph`. The runtime has no recursive budget ledger, tree-shaped evidence model, or UI fault navigator. Limiting v1 to a single graph keeps node failures attributable and cost bounded by existing node, parallel, and tool limits.
Existing task events receive the adaptation report, resolved tools, policy removals, completion status, and evidence gaps. Existing Skill learning/replay remains unchanged in v1. Template-specific scoring waits until execution semantics are stable.
Compatibility guarantees:
- Skills without templates activate and execute unchanged.
- Existing direct `ExecutionGraph` callers work because new fields have defaults.
- Single-agent runs do not receive node tool policies or outcome prefixes.
- Existing external registry descriptors are not removed; planner-created Team nodes stay generic and role-empty.
- `TaskSkillResolver` remains the per-node published-Skill/ephemeral-guidance fallback.
## Verification Criteria
Implementation is accepted when tests prove that old Skills and single-agent tasks retain behavior; templates parse and degrade with warnings; planner emits only task-oriented generic workers; malformed output is repaired once or falls back; unknown/high-risk tools cannot execute; declared evidence controls node success; required-node failure forces incomplete synthesis; and the MGM/Galaxy case uses read-only web tools to produce evidence-backed comparison/report output without claiming a chart renderer.