Files
beaver_project/docs/superpowers/specs/2026-05-22-task-evidence-validation-design.md

11 KiB

Task Evidence and Validation Redesign

Date: 2026-05-22

Context

Two recent task runs exposed the same underlying weakness from different angles:

  • The agent can use complete tool results, but validation only receives truncated excerpts. A key fact can be present in the run log yet absent from validation context, causing a false rejection.
  • Team execution can gather useful evidence in sub-agent runs, but that evidence is not reliably carried into final synthesis or validation. Failed team nodes are especially lossy.
  • Team graphs marked parallel are currently scheduled through a shared single-consumer AgentLoop, so production execution can be effectively serial.
  • Final synthesis after a team run still has full tools available, so it can repeat searches instead of synthesizing from team evidence.
  • max_tool_iterations stops the tool loop with a placeholder message instead of forcing a final answer from already gathered evidence.
  • Validation failures enter an open-looking state, which makes the UI feel like the task never completed.

The selected approach is a medium refactor: keep the existing AgentService, TeamService, and AgentLoop structure, but add a structured evidence pipeline, clearer validation semantics, finite team concurrency, no-tools synthesis after team runs, and explicit task states.

Goals

  • Preserve complete run evidence for synthesis and validation.
  • Stop using fixed truncation for validation inputs.
  • Distinguish "answer is contradicted" from "validator lacks enough evidence".
  • Let user feedback be the final business judgment after an answer is shown.
  • Make parallel team execution actually concurrent within a bounded limit.
  • Prevent final synthesis from repeating team tool work by default.
  • Produce a useful final answer when tool iteration limits are reached.
  • Add enough debug metadata to diagnose validation decisions without reconstructing SQLite logs by hand.

Non-Goals

  • Rewriting the whole execution runtime.
  • Introducing a distributed worker pool.
  • Building a generic evidence bus for every future subsystem.
  • Solving all provider rate-limit and storage concurrency concerns beyond the bounded local concurrency needed for team parallel nodes.

Validation Semantics and Task States

Automatic validation becomes advisory evidence assessment, not the final user satisfaction signal.

Validation results should include:

status: Literal["accepted", "rejected", "insufficient_evidence", "validator_error"]
passed: bool
score: float
issues: list[str]
missing_requirements: list[str]
evidence_gaps: list[str]
recommended_revision_prompt: str

Rules:

  • accepted: the final answer is supported by available evidence and satisfies the task. The task enters awaiting_feedback.
  • insufficient_evidence: the validator cannot confirm the answer from available evidence. It must not claim fabrication or contradiction. The task enters needs_review.
  • validator_error: the validator failed to produce a reliable decision. The task enters needs_review.
  • rejected: the evidence clearly contradicts the answer, or the answer clearly misses the task. The first attempt can trigger retry. The last attempt enters failed only when there is no usable answer; otherwise it enters needs_review.

Task statuses:

  • open: task exists but has not started.
  • running: execution is active.
  • validating: final answer exists and automatic validation is running.
  • awaiting_feedback: answer is available and automatic validation accepted it.
  • needs_review: answer is available, but automatic validation could not confirm it or hit a validator error.
  • needs_revision: user requested revision, or automatic validation rejected an attempt that can still be retried.
  • failed: execution ended without a usable answer.
  • closed: user marked the answer satisfied.
  • abandoned: user abandoned the task.

needs_review remains an open status for the active task API, but the UI should distinguish it from running. failed, closed, and abandoned are terminal.

User feedback is authoritative:

  • satisfied closes the task.
  • revise moves the task to needs_revision.
  • abandon moves the task to abandoned.

Evidence Models

Add structured evidence models in the task or coordinator layer.

@dataclass(slots=True)
class ToolEvidence:
    tool_name: str
    tool_call_id: str | None
    content: str
    event_payload: dict[str, Any]
    url: str | None = None
    title: str | None = None
    created_at: str | None = None


@dataclass(slots=True)
class RunEvidence:
    run_id: str
    session_id: str
    output_text: str
    finish_reason: str
    transcript: list[dict[str, Any]]
    tool_results: list[ToolEvidence]
    warnings: list[str]


@dataclass(slots=True)
class TaskEvidencePacket:
    task_id: str
    attempt_index: int
    main_run: RunEvidence | None
    team_runs: list[RunEvidence]
    team_node_results: list[NodeRunResult]
    final_output: str

llm_request_snapshotted events are debug material, not task evidence. They may be referenced in validation debug metadata, but validation should primarily consume transcript, tool results, team node outputs, and final output.

Evidence Data Flow

  1. AgentLoop continues to write session events as it does now.
  2. After a run completes, an evidence builder reads session_manager.get_run_event_records(session_id, run_id) and creates RunEvidence.
  3. LocalAgentRunner.run() attaches RunEvidence to NodeRunResult.
  4. NodeRunResult gains evidence: RunEvidence | None.
  5. TeamRunResult carries node evidence through node_results; it may also expose a convenience run_evidence list.
  6. AgentService._run_task_mode() builds a TaskEvidencePacket after team execution and final synthesis.
  7. Final synthesis receives a rendered evidence context built from the same packet.
  8. ValidationService.validate_task_result() receives the same packet and renders it into the validation prompt without fixed truncation.

Failed or partial nodes must still preserve evidence. A node with finish_reason="max_tool_iterations" can be unsuccessful while still carrying useful tool results.

Final Synthesis Behavior

For team-backed task plans, final synthesis defaults to no tools:

include_tools = False
max_tool_iterations = 0

The synthesis prompt should instruct the main agent to:

  • use team evidence as the source of truth;
  • avoid repeating failed or completed tool calls;
  • answer with available evidence;
  • clearly state missing or uncertain information.

The planner may explicitly allow a small synthesis tool budget, but the default is no-tools synthesis. If allowed, the budget should be small, such as max_tool_iterations=1.

Tool Iteration Finalization

When a run reaches max_tool_iterations and the model still requests tools, the loop should not return Tool loop stopped... as the final user-visible answer.

Instead, the loop performs one no-tools finalization call:

  • use the accumulated messages and tool results;
  • call the provider with tools=None;
  • add an instruction that the tool budget is exhausted and the model must answer from existing evidence;
  • mark the finish reason as max_tool_iterations_finalized or another explicit non-stop value;
  • return the finalization text as output_text.

If finalization itself fails or returns empty content, only then use a clear fallback message explaining that the run could not produce a usable answer.

Limited Parallel Team Execution

parallel team nodes should run concurrently without rewriting the runtime.

Design:

  • Keep sequence and DAG behavior on the shared loop where appropriate.
  • For parallel graph batches, run nodes through isolated AgentLoop instances.
  • Each isolated loop uses the same workspace and service configuration so session and run records remain queryable from the same stores.
  • Add max_parallel_team_nodes, default 3.
  • Use an asyncio.Semaphore in the scheduler to bound concurrent nodes.
  • Return TeamRunResult.node_results in graph node order, not completion order.

The implementation should check shared store concurrency. If the current store is not safe for concurrent async writes, add a narrow lock around session/task/run store writes used by these parallel runs.

Validation Prompt

The validation prompt should consume the full rendered evidence packet, without [:2500], [:500], or [:12] fixed caps.

Required validator instructions:

  • Return only JSON with the validation fields.
  • If evidence is incomplete, return insufficient_evidence.
  • Only return rejected for clear contradiction or clear task failure.
  • Do not infer fabrication from missing evidence.
  • Do not claim a source lacks a fact unless the rendered evidence proves that absence.
  • Treat user feedback as the final business judgment outside automatic validation.

The validator should still be strict about answer quality when evidence is sufficient.

Validation Debug Metadata

Each task_validation_snapshotted event should record:

  • validation result;
  • validation status;
  • attempt index;
  • evidence run ids;
  • evidence session ids;
  • tool result count;
  • evidence character length;
  • validator raw response;
  • rendered validation input or prompt, unless a debug setting disables full prompt storage.

This makes future investigations direct: inspect the exact input the validator saw before interpreting its decision.

Log Snapshot Size

llm_request_snapshotted currently stores complete messages and complete tool schemas in both payload and content. That makes logs large and slows inspection.

Default behavior should change to store a compact payload:

  • iteration;
  • provider name and model;
  • message count;
  • tool names;
  • message character length;
  • tool schema character length;
  • max tokens, temperature, thinking flag.

Full request snapshots should be controlled by a debug config flag. This does not reduce validation evidence because evidence comes from transcript and tool result events.

Testing Plan

Add or update focused unit tests:

  1. Validation evidence is not fixed-truncated. A fact after the first 500 characters of a tool result still appears in the validator input.
  2. Missing evidence returns insufficient_evidence and moves the task to needs_review, not failed.
  3. A team node that ends with max_tool_iterations preserves tool evidence in NodeRunResult.evidence.
  4. Team final synthesis defaults to tools=None and receives rendered team evidence.
  5. Parallel team nodes start concurrently under a bounded semaphore and results remain in graph order.
  6. Tool loop finalization produces a user-visible answer instead of the placeholder stop message.
  7. Status transitions cover accepted -> awaiting_feedback, insufficient_evidence -> needs_review, validator_error -> needs_review, and terminal failed.
  8. Validation debug events include evidence metadata and validator raw response.

Migration Notes

To reduce risk, implement in layers:

  1. Add evidence models and builders without changing behavior.
  2. Attach evidence to team node results.
  3. Switch final synthesis for team plans to no-tools evidence synthesis.
  4. Switch validation to evidence packets and new statuses.
  5. Add no-tools finalization for tool iteration limits.
  6. Add limited isolated-loop parallel execution.
  7. Slim llm_request_snapshotted behind a debug flag.

This order keeps each change testable and lets the old transcript-summary path remain as a temporary fallback while evidence packets are introduced.