13 KiB
Task Evidence and Validation Redesign
Date: 2026-05-22
Context
Two recent task runs exposed the same underlying weakness from different angles:
- The agent can use complete tool results, but validation only receives truncated excerpts. A key fact can be present in the run log yet absent from validation context, causing a false rejection.
- Team execution can gather useful evidence in sub-agent runs, but that evidence is not reliably carried into final synthesis or validation. Failed team nodes are especially lossy.
- Team graphs marked
parallelare currently scheduled through a shared single-consumerAgentLoop, so production execution can be effectively serial. - Final synthesis after a team run still has full tools available, so it can repeat searches instead of synthesizing from team evidence.
max_tool_iterationsstops the tool loop with a placeholder message instead of forcing a final answer from already gathered evidence.- Validation failures enter an open-looking state, which makes the UI feel like the task never completed.
The selected approach is a medium refactor: keep the existing AgentService, TeamService, and AgentLoop structure, but add a structured evidence pipeline, clearer validation semantics, finite team concurrency, no-tools synthesis after team runs, and explicit task states.
Goals
- Preserve complete run evidence for synthesis and validation.
- Stop using fixed truncation for validation inputs.
- Distinguish "answer is contradicted" from "validator lacks enough evidence".
- Let user feedback be the final business judgment after an answer is shown.
- Make
parallelteam execution actually concurrent within a bounded limit. - Prevent final synthesis from repeating team tool work by default.
- Produce a useful final answer when tool iteration limits are reached.
- Add enough debug metadata to diagnose validation decisions without reconstructing SQLite logs by hand.
Non-Goals
- Rewriting the whole execution runtime.
- Introducing a distributed worker pool.
- Building a generic evidence bus for every future subsystem.
- Solving all provider rate-limit and storage concurrency concerns beyond the bounded local concurrency needed for team parallel nodes.
Validation Semantics and Task States
Automatic validation becomes advisory evidence assessment, not the final user satisfaction signal.
Validation results should include:
status: Literal["accepted", "rejected", "insufficient_evidence", "validator_error"]
passed: bool
score: float
issues: list[str]
missing_requirements: list[str]
evidence_gaps: list[str]
recommended_revision_prompt: str
status is the business decision field. passed is a compatibility boolean derived from status, not an independent source of truth. The mapping is:
status == "accepted"->passed=Truestatus in {"rejected", "insufficient_evidence", "validator_error"}->passed=False
Task mode, retry, and status transition logic must branch on status. New code treats status == "accepted" as the acceptance condition. Existing compatibility paths may continue to interpret acceptance as passed and score >= 0.75 until they are migrated, but new logic should not derive status from passed or infer failure from passed=False alone.
Rules:
accepted: the final answer is supported by available evidence and satisfies the task. The task entersawaiting_feedback.insufficient_evidence: the validator cannot confirm the answer from available evidence. It must not claim fabrication or contradiction. The task entersneeds_review.validator_error: the validator failed to produce a reliable decision. The task entersneeds_review.rejected: the evidence clearly contradicts the answer, or the answer clearly misses the task. The first attempt can trigger retry. The last attempt entersfailedonly when there is no usable answer; otherwise it entersneeds_review.
Task statuses:
open: task exists but has not started.running: execution is active.validating: final answer exists and automatic validation is running.awaiting_feedback: answer is available and automatic validation accepted it.needs_review: answer is available, but automatic validation could not confirm it or hit a validator error.needs_revision: user requested revision, or automatic validation rejected an attempt that can still be retried.failed: execution ended without a usable answer.closed: user marked the answer satisfied.abandoned: user abandoned the task.
needs_review remains an open status for the active task API, but the UI should distinguish it from running. failed, closed, and abandoned are terminal.
Open status does not mean auto-runnable. The backend should split status semantics:
is_open: the task can still receive user feedback or revision.is_execution_active: the backend is currently running or validating work.requires_user_action: the task has stopped automatic execution and needs user input.
needs_review should have is_open=True, is_execution_active=False, and requires_user_action=True. Schedulers, automatic retry loops, and active-task polling must not treat needs_review as a reason to continue execution. It should appear in the active task API only so the user can review, mark satisfied, revise, or abandon.
User feedback is authoritative:
satisfiedcloses the task.revisemoves the task toneeds_revision.abandonmoves the task toabandoned.
Evidence Models
Add structured evidence models in the task or coordinator layer.
@dataclass(slots=True)
class ToolEvidence:
tool_name: str
tool_call_id: str | None
content: str
event_payload: dict[str, Any]
url: str | None = None
title: str | None = None
created_at: str | None = None
@dataclass(slots=True)
class RunEvidence:
run_id: str
session_id: str
output_text: str
finish_reason: str
transcript: list[dict[str, Any]]
tool_results: list[ToolEvidence]
warnings: list[str]
@dataclass(slots=True)
class TaskEvidencePacket:
task_id: str
attempt_index: int
main_run: RunEvidence | None
team_runs: list[RunEvidence]
team_node_results: list[NodeRunResult]
final_output: str
llm_request_snapshotted events are debug material, not task evidence. They may be referenced in validation debug metadata, but validation should primarily consume transcript, tool results, team node outputs, and final output.
Evidence Data Flow
AgentLoopcontinues to write session events as it does now.- After a run completes, an evidence builder reads
session_manager.get_run_event_records(session_id, run_id)and createsRunEvidence. LocalAgentRunner.run()attachesRunEvidencetoNodeRunResult.NodeRunResultgainsevidence: RunEvidence | None.TeamRunResultcarries node evidence throughnode_results; it may also expose a conveniencerun_evidencelist.AgentService._run_task_mode()builds aTaskEvidencePacketafter team execution and final synthesis.- Final synthesis receives a rendered evidence context built from the same packet.
ValidationService.validate_task_result()receives the same packet and renders it into the validation prompt without fixed truncation.
Failed or partial nodes must still preserve evidence. A node with finish_reason="max_tool_iterations" can be unsuccessful while still carrying useful tool results.
Final Synthesis Behavior
For team-backed task plans, final synthesis defaults to no tools:
include_tools = False
max_tool_iterations = 0
The synthesis prompt should instruct the main agent to:
- use team evidence as the source of truth;
- avoid repeating failed or completed tool calls;
- answer with available evidence;
- clearly state missing or uncertain information.
The planner may explicitly allow a small synthesis tool budget, but the default is no-tools synthesis. If allowed, the budget should be small, such as max_tool_iterations=1.
Tool Iteration Finalization
When a run reaches max_tool_iterations and the model still requests tools, the loop should not return Tool loop stopped... as the final user-visible answer.
Instead, the loop performs one no-tools finalization call:
- use the accumulated messages and tool results;
- call the provider with
tools=None; - add an instruction that the tool budget is exhausted and the model must answer from existing evidence;
- mark the finish reason as
max_tool_iterations_finalizedor another explicit non-stop value; - return the finalization text as
output_text.
If finalization itself fails or returns empty content, only then use a clear fallback message explaining that the run could not produce a usable answer.
Limited Parallel Team Execution
parallel team nodes should run concurrently without rewriting the runtime.
Design:
- Keep sequence and DAG behavior on the shared loop where appropriate.
- For
parallelgraph batches, run nodes through isolatedAgentLoopinstances. - Each isolated loop uses the same workspace and service configuration so session and run records remain queryable from the same stores.
- Add
max_parallel_team_nodes, default3. - Use an
asyncio.Semaphorein the scheduler to bound concurrent nodes. - Return
TeamRunResult.node_resultsin graph node order, not completion order.
The implementation should check shared store concurrency. If the current store is not safe for concurrent async writes, add a narrow lock around session/task/run store writes used by these parallel runs.
Validation Prompt
The validation prompt should consume the full rendered evidence packet, without [:2500], [:500], or [:12] fixed caps.
Required validator instructions:
- Return only JSON with the validation fields.
- If evidence is incomplete, return
insufficient_evidence. - Only return
rejectedfor clear contradiction or clear task failure. - Do not infer fabrication from missing evidence.
- Do not claim a source lacks a fact unless the rendered evidence proves that absence.
- Treat user feedback as the final business judgment outside automatic validation.
The validator should still be strict about answer quality when evidence is sufficient.
Validation Debug Metadata
Each task_validation_snapshotted event should record:
- validation result;
- validation status;
- attempt index;
- evidence run ids;
- evidence session ids;
- tool result count;
- evidence character length;
- validator raw response;
- rendered validation input or prompt, unless a debug setting disables full prompt storage.
This makes future investigations direct: inspect the exact input the validator saw before interpreting its decision.
Log Snapshot Size
llm_request_snapshotted currently stores complete messages and complete tool schemas in both payload and content. That makes logs large and slows inspection.
Default behavior should change to store a compact payload:
- iteration;
- provider name and model;
- message count;
- tool names;
- message character length;
- tool schema character length;
- max tokens, temperature, thinking flag.
Full request snapshots should be controlled by a debug config flag. This does not reduce validation evidence because evidence comes from transcript and tool result events.
Testing Plan
Add or update focused unit tests:
- Validation evidence is not fixed-truncated. A fact after the first 500 characters of a tool result still appears in the validator input.
- Missing evidence returns
insufficient_evidenceand moves the task toneeds_review, notfailed. - A team node that ends with
max_tool_iterationspreserves tool evidence inNodeRunResult.evidence. - Team final synthesis defaults to
tools=Noneand receives rendered team evidence. - Parallel team nodes start concurrently under a bounded semaphore and results remain in graph order.
- Tool loop finalization produces a user-visible answer instead of the placeholder stop message.
- Status transitions cover
accepted -> awaiting_feedback,insufficient_evidence -> needs_review,validator_error -> needs_review, and terminalfailed. - Validation debug events include evidence metadata and validator raw response.
Migration Notes
To reduce risk, implement in layers:
- Add evidence models and builders without changing behavior.
- Attach evidence to team node results.
- Switch final synthesis for team plans to no-tools evidence synthesis.
- Switch validation to evidence packets and new statuses.
- Add no-tools finalization for tool iteration limits.
- Add limited isolated-loop parallel execution.
- Slim
llm_request_snapshottedbehind a debug flag.
This order keeps each change testable and lets the old transcript-summary path remain as a temporary fallback while evidence packets are introduced.