diff --git a/docs/superpowers/specs/2026-05-22-task-evidence-validation-design.md b/docs/superpowers/specs/2026-05-22-task-evidence-validation-design.md new file mode 100644 index 0000000..e3d260e --- /dev/null +++ b/docs/superpowers/specs/2026-05-22-task-evidence-validation-design.md @@ -0,0 +1,250 @@ +# Task Evidence and Validation Redesign + +Date: 2026-05-22 + +## Context + +Two recent task runs exposed the same underlying weakness from different angles: + +- The agent can use complete tool results, but validation only receives truncated excerpts. A key fact can be present in the run log yet absent from validation context, causing a false rejection. +- Team execution can gather useful evidence in sub-agent runs, but that evidence is not reliably carried into final synthesis or validation. Failed team nodes are especially lossy. +- Team graphs marked `parallel` are currently scheduled through a shared single-consumer `AgentLoop`, so production execution can be effectively serial. +- Final synthesis after a team run still has full tools available, so it can repeat searches instead of synthesizing from team evidence. +- `max_tool_iterations` stops the tool loop with a placeholder message instead of forcing a final answer from already gathered evidence. +- Validation failures enter an open-looking state, which makes the UI feel like the task never completed. + +The selected approach is a medium refactor: keep the existing `AgentService`, `TeamService`, and `AgentLoop` structure, but add a structured evidence pipeline, clearer validation semantics, finite team concurrency, no-tools synthesis after team runs, and explicit task states. + +## Goals + +- Preserve complete run evidence for synthesis and validation. +- Stop using fixed truncation for validation inputs. +- Distinguish "answer is contradicted" from "validator lacks enough evidence". +- Let user feedback be the final business judgment after an answer is shown. +- Make `parallel` team execution actually concurrent within a bounded limit. +- Prevent final synthesis from repeating team tool work by default. +- Produce a useful final answer when tool iteration limits are reached. +- Add enough debug metadata to diagnose validation decisions without reconstructing SQLite logs by hand. + +## Non-Goals + +- Rewriting the whole execution runtime. +- Introducing a distributed worker pool. +- Building a generic evidence bus for every future subsystem. +- Solving all provider rate-limit and storage concurrency concerns beyond the bounded local concurrency needed for team parallel nodes. + +## Validation Semantics and Task States + +Automatic validation becomes advisory evidence assessment, not the final user satisfaction signal. + +Validation results should include: + +```python +status: Literal["accepted", "rejected", "insufficient_evidence", "validator_error"] +passed: bool +score: float +issues: list[str] +missing_requirements: list[str] +evidence_gaps: list[str] +recommended_revision_prompt: str +``` + +Rules: + +- `accepted`: the final answer is supported by available evidence and satisfies the task. The task enters `awaiting_feedback`. +- `insufficient_evidence`: the validator cannot confirm the answer from available evidence. It must not claim fabrication or contradiction. The task enters `needs_review`. +- `validator_error`: the validator failed to produce a reliable decision. The task enters `needs_review`. +- `rejected`: the evidence clearly contradicts the answer, or the answer clearly misses the task. The first attempt can trigger retry. The last attempt enters `failed` only when there is no usable answer; otherwise it enters `needs_review`. + +Task statuses: + +- `open`: task exists but has not started. +- `running`: execution is active. +- `validating`: final answer exists and automatic validation is running. +- `awaiting_feedback`: answer is available and automatic validation accepted it. +- `needs_review`: answer is available, but automatic validation could not confirm it or hit a validator error. +- `needs_revision`: user requested revision, or automatic validation rejected an attempt that can still be retried. +- `failed`: execution ended without a usable answer. +- `closed`: user marked the answer satisfied. +- `abandoned`: user abandoned the task. + +`needs_review` remains an open status for the active task API, but the UI should distinguish it from `running`. `failed`, `closed`, and `abandoned` are terminal. + +User feedback is authoritative: + +- `satisfied` closes the task. +- `revise` moves the task to `needs_revision`. +- `abandon` moves the task to `abandoned`. + +## Evidence Models + +Add structured evidence models in the task or coordinator layer. + +```python +@dataclass(slots=True) +class ToolEvidence: + tool_name: str + tool_call_id: str | None + content: str + event_payload: dict[str, Any] + url: str | None = None + title: str | None = None + created_at: str | None = None + + +@dataclass(slots=True) +class RunEvidence: + run_id: str + session_id: str + output_text: str + finish_reason: str + transcript: list[dict[str, Any]] + tool_results: list[ToolEvidence] + warnings: list[str] + + +@dataclass(slots=True) +class TaskEvidencePacket: + task_id: str + attempt_index: int + main_run: RunEvidence | None + team_runs: list[RunEvidence] + team_node_results: list[NodeRunResult] + final_output: str +``` + +`llm_request_snapshotted` events are debug material, not task evidence. They may be referenced in validation debug metadata, but validation should primarily consume transcript, tool results, team node outputs, and final output. + +## Evidence Data Flow + +1. `AgentLoop` continues to write session events as it does now. +2. After a run completes, an evidence builder reads `session_manager.get_run_event_records(session_id, run_id)` and creates `RunEvidence`. +3. `LocalAgentRunner.run()` attaches `RunEvidence` to `NodeRunResult`. +4. `NodeRunResult` gains `evidence: RunEvidence | None`. +5. `TeamRunResult` carries node evidence through `node_results`; it may also expose a convenience `run_evidence` list. +6. `AgentService._run_task_mode()` builds a `TaskEvidencePacket` after team execution and final synthesis. +7. Final synthesis receives a rendered evidence context built from the same packet. +8. `ValidationService.validate_task_result()` receives the same packet and renders it into the validation prompt without fixed truncation. + +Failed or partial nodes must still preserve evidence. A node with `finish_reason="max_tool_iterations"` can be unsuccessful while still carrying useful tool results. + +## Final Synthesis Behavior + +For team-backed task plans, final synthesis defaults to no tools: + +```python +include_tools = False +max_tool_iterations = 0 +``` + +The synthesis prompt should instruct the main agent to: + +- use team evidence as the source of truth; +- avoid repeating failed or completed tool calls; +- answer with available evidence; +- clearly state missing or uncertain information. + +The planner may explicitly allow a small synthesis tool budget, but the default is no-tools synthesis. If allowed, the budget should be small, such as `max_tool_iterations=1`. + +## Tool Iteration Finalization + +When a run reaches `max_tool_iterations` and the model still requests tools, the loop should not return `Tool loop stopped...` as the final user-visible answer. + +Instead, the loop performs one no-tools finalization call: + +- use the accumulated messages and tool results; +- call the provider with `tools=None`; +- add an instruction that the tool budget is exhausted and the model must answer from existing evidence; +- mark the finish reason as `max_tool_iterations_finalized` or another explicit non-stop value; +- return the finalization text as `output_text`. + +If finalization itself fails or returns empty content, only then use a clear fallback message explaining that the run could not produce a usable answer. + +## Limited Parallel Team Execution + +`parallel` team nodes should run concurrently without rewriting the runtime. + +Design: + +- Keep sequence and DAG behavior on the shared loop where appropriate. +- For `parallel` graph batches, run nodes through isolated `AgentLoop` instances. +- Each isolated loop uses the same workspace and service configuration so session and run records remain queryable from the same stores. +- Add `max_parallel_team_nodes`, default `3`. +- Use an `asyncio.Semaphore` in the scheduler to bound concurrent nodes. +- Return `TeamRunResult.node_results` in graph node order, not completion order. + +The implementation should check shared store concurrency. If the current store is not safe for concurrent async writes, add a narrow lock around session/task/run store writes used by these parallel runs. + +## Validation Prompt + +The validation prompt should consume the full rendered evidence packet, without `[:2500]`, `[:500]`, or `[:12]` fixed caps. + +Required validator instructions: + +- Return only JSON with the validation fields. +- If evidence is incomplete, return `insufficient_evidence`. +- Only return `rejected` for clear contradiction or clear task failure. +- Do not infer fabrication from missing evidence. +- Do not claim a source lacks a fact unless the rendered evidence proves that absence. +- Treat user feedback as the final business judgment outside automatic validation. + +The validator should still be strict about answer quality when evidence is sufficient. + +## Validation Debug Metadata + +Each `task_validation_snapshotted` event should record: + +- validation result; +- validation status; +- attempt index; +- evidence run ids; +- evidence session ids; +- tool result count; +- evidence character length; +- validator raw response; +- rendered validation input or prompt, unless a debug setting disables full prompt storage. + +This makes future investigations direct: inspect the exact input the validator saw before interpreting its decision. + +## Log Snapshot Size + +`llm_request_snapshotted` currently stores complete messages and complete tool schemas in both payload and content. That makes logs large and slows inspection. + +Default behavior should change to store a compact payload: + +- iteration; +- provider name and model; +- message count; +- tool names; +- message character length; +- tool schema character length; +- max tokens, temperature, thinking flag. + +Full request snapshots should be controlled by a debug config flag. This does not reduce validation evidence because evidence comes from transcript and tool result events. + +## Testing Plan + +Add or update focused unit tests: + +1. Validation evidence is not fixed-truncated. A fact after the first 500 characters of a tool result still appears in the validator input. +2. Missing evidence returns `insufficient_evidence` and moves the task to `needs_review`, not `failed`. +3. A team node that ends with `max_tool_iterations` preserves tool evidence in `NodeRunResult.evidence`. +4. Team final synthesis defaults to `tools=None` and receives rendered team evidence. +5. Parallel team nodes start concurrently under a bounded semaphore and results remain in graph order. +6. Tool loop finalization produces a user-visible answer instead of the placeholder stop message. +7. Status transitions cover `accepted -> awaiting_feedback`, `insufficient_evidence -> needs_review`, `validator_error -> needs_review`, and terminal `failed`. +8. Validation debug events include evidence metadata and validator raw response. + +## Migration Notes + +To reduce risk, implement in layers: + +1. Add evidence models and builders without changing behavior. +2. Attach evidence to team node results. +3. Switch final synthesis for team plans to no-tools evidence synthesis. +4. Switch validation to evidence packets and new statuses. +5. Add no-tools finalization for tool iteration limits. +6. Add limited isolated-loop parallel execution. +7. Slim `llm_request_snapshotted` behind a debug flag. + +This order keeps each change testable and lets the old transcript-summary path remain as a temporary fallback while evidence packets are introduced.