docs: design task evidence validation refactor
This commit is contained in:
@ -0,0 +1,250 @@
|
||||
# Task Evidence and Validation Redesign
|
||||
|
||||
Date: 2026-05-22
|
||||
|
||||
## Context
|
||||
|
||||
Two recent task runs exposed the same underlying weakness from different angles:
|
||||
|
||||
- The agent can use complete tool results, but validation only receives truncated excerpts. A key fact can be present in the run log yet absent from validation context, causing a false rejection.
|
||||
- Team execution can gather useful evidence in sub-agent runs, but that evidence is not reliably carried into final synthesis or validation. Failed team nodes are especially lossy.
|
||||
- Team graphs marked `parallel` are currently scheduled through a shared single-consumer `AgentLoop`, so production execution can be effectively serial.
|
||||
- Final synthesis after a team run still has full tools available, so it can repeat searches instead of synthesizing from team evidence.
|
||||
- `max_tool_iterations` stops the tool loop with a placeholder message instead of forcing a final answer from already gathered evidence.
|
||||
- Validation failures enter an open-looking state, which makes the UI feel like the task never completed.
|
||||
|
||||
The selected approach is a medium refactor: keep the existing `AgentService`, `TeamService`, and `AgentLoop` structure, but add a structured evidence pipeline, clearer validation semantics, finite team concurrency, no-tools synthesis after team runs, and explicit task states.
|
||||
|
||||
## Goals
|
||||
|
||||
- Preserve complete run evidence for synthesis and validation.
|
||||
- Stop using fixed truncation for validation inputs.
|
||||
- Distinguish "answer is contradicted" from "validator lacks enough evidence".
|
||||
- Let user feedback be the final business judgment after an answer is shown.
|
||||
- Make `parallel` team execution actually concurrent within a bounded limit.
|
||||
- Prevent final synthesis from repeating team tool work by default.
|
||||
- Produce a useful final answer when tool iteration limits are reached.
|
||||
- Add enough debug metadata to diagnose validation decisions without reconstructing SQLite logs by hand.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Rewriting the whole execution runtime.
|
||||
- Introducing a distributed worker pool.
|
||||
- Building a generic evidence bus for every future subsystem.
|
||||
- Solving all provider rate-limit and storage concurrency concerns beyond the bounded local concurrency needed for team parallel nodes.
|
||||
|
||||
## Validation Semantics and Task States
|
||||
|
||||
Automatic validation becomes advisory evidence assessment, not the final user satisfaction signal.
|
||||
|
||||
Validation results should include:
|
||||
|
||||
```python
|
||||
status: Literal["accepted", "rejected", "insufficient_evidence", "validator_error"]
|
||||
passed: bool
|
||||
score: float
|
||||
issues: list[str]
|
||||
missing_requirements: list[str]
|
||||
evidence_gaps: list[str]
|
||||
recommended_revision_prompt: str
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- `accepted`: the final answer is supported by available evidence and satisfies the task. The task enters `awaiting_feedback`.
|
||||
- `insufficient_evidence`: the validator cannot confirm the answer from available evidence. It must not claim fabrication or contradiction. The task enters `needs_review`.
|
||||
- `validator_error`: the validator failed to produce a reliable decision. The task enters `needs_review`.
|
||||
- `rejected`: the evidence clearly contradicts the answer, or the answer clearly misses the task. The first attempt can trigger retry. The last attempt enters `failed` only when there is no usable answer; otherwise it enters `needs_review`.
|
||||
|
||||
Task statuses:
|
||||
|
||||
- `open`: task exists but has not started.
|
||||
- `running`: execution is active.
|
||||
- `validating`: final answer exists and automatic validation is running.
|
||||
- `awaiting_feedback`: answer is available and automatic validation accepted it.
|
||||
- `needs_review`: answer is available, but automatic validation could not confirm it or hit a validator error.
|
||||
- `needs_revision`: user requested revision, or automatic validation rejected an attempt that can still be retried.
|
||||
- `failed`: execution ended without a usable answer.
|
||||
- `closed`: user marked the answer satisfied.
|
||||
- `abandoned`: user abandoned the task.
|
||||
|
||||
`needs_review` remains an open status for the active task API, but the UI should distinguish it from `running`. `failed`, `closed`, and `abandoned` are terminal.
|
||||
|
||||
User feedback is authoritative:
|
||||
|
||||
- `satisfied` closes the task.
|
||||
- `revise` moves the task to `needs_revision`.
|
||||
- `abandon` moves the task to `abandoned`.
|
||||
|
||||
## Evidence Models
|
||||
|
||||
Add structured evidence models in the task or coordinator layer.
|
||||
|
||||
```python
|
||||
@dataclass(slots=True)
|
||||
class ToolEvidence:
|
||||
tool_name: str
|
||||
tool_call_id: str | None
|
||||
content: str
|
||||
event_payload: dict[str, Any]
|
||||
url: str | None = None
|
||||
title: str | None = None
|
||||
created_at: str | None = None
|
||||
|
||||
|
||||
@dataclass(slots=True)
|
||||
class RunEvidence:
|
||||
run_id: str
|
||||
session_id: str
|
||||
output_text: str
|
||||
finish_reason: str
|
||||
transcript: list[dict[str, Any]]
|
||||
tool_results: list[ToolEvidence]
|
||||
warnings: list[str]
|
||||
|
||||
|
||||
@dataclass(slots=True)
|
||||
class TaskEvidencePacket:
|
||||
task_id: str
|
||||
attempt_index: int
|
||||
main_run: RunEvidence | None
|
||||
team_runs: list[RunEvidence]
|
||||
team_node_results: list[NodeRunResult]
|
||||
final_output: str
|
||||
```
|
||||
|
||||
`llm_request_snapshotted` events are debug material, not task evidence. They may be referenced in validation debug metadata, but validation should primarily consume transcript, tool results, team node outputs, and final output.
|
||||
|
||||
## Evidence Data Flow
|
||||
|
||||
1. `AgentLoop` continues to write session events as it does now.
|
||||
2. After a run completes, an evidence builder reads `session_manager.get_run_event_records(session_id, run_id)` and creates `RunEvidence`.
|
||||
3. `LocalAgentRunner.run()` attaches `RunEvidence` to `NodeRunResult`.
|
||||
4. `NodeRunResult` gains `evidence: RunEvidence | None`.
|
||||
5. `TeamRunResult` carries node evidence through `node_results`; it may also expose a convenience `run_evidence` list.
|
||||
6. `AgentService._run_task_mode()` builds a `TaskEvidencePacket` after team execution and final synthesis.
|
||||
7. Final synthesis receives a rendered evidence context built from the same packet.
|
||||
8. `ValidationService.validate_task_result()` receives the same packet and renders it into the validation prompt without fixed truncation.
|
||||
|
||||
Failed or partial nodes must still preserve evidence. A node with `finish_reason="max_tool_iterations"` can be unsuccessful while still carrying useful tool results.
|
||||
|
||||
## Final Synthesis Behavior
|
||||
|
||||
For team-backed task plans, final synthesis defaults to no tools:
|
||||
|
||||
```python
|
||||
include_tools = False
|
||||
max_tool_iterations = 0
|
||||
```
|
||||
|
||||
The synthesis prompt should instruct the main agent to:
|
||||
|
||||
- use team evidence as the source of truth;
|
||||
- avoid repeating failed or completed tool calls;
|
||||
- answer with available evidence;
|
||||
- clearly state missing or uncertain information.
|
||||
|
||||
The planner may explicitly allow a small synthesis tool budget, but the default is no-tools synthesis. If allowed, the budget should be small, such as `max_tool_iterations=1`.
|
||||
|
||||
## Tool Iteration Finalization
|
||||
|
||||
When a run reaches `max_tool_iterations` and the model still requests tools, the loop should not return `Tool loop stopped...` as the final user-visible answer.
|
||||
|
||||
Instead, the loop performs one no-tools finalization call:
|
||||
|
||||
- use the accumulated messages and tool results;
|
||||
- call the provider with `tools=None`;
|
||||
- add an instruction that the tool budget is exhausted and the model must answer from existing evidence;
|
||||
- mark the finish reason as `max_tool_iterations_finalized` or another explicit non-stop value;
|
||||
- return the finalization text as `output_text`.
|
||||
|
||||
If finalization itself fails or returns empty content, only then use a clear fallback message explaining that the run could not produce a usable answer.
|
||||
|
||||
## Limited Parallel Team Execution
|
||||
|
||||
`parallel` team nodes should run concurrently without rewriting the runtime.
|
||||
|
||||
Design:
|
||||
|
||||
- Keep sequence and DAG behavior on the shared loop where appropriate.
|
||||
- For `parallel` graph batches, run nodes through isolated `AgentLoop` instances.
|
||||
- Each isolated loop uses the same workspace and service configuration so session and run records remain queryable from the same stores.
|
||||
- Add `max_parallel_team_nodes`, default `3`.
|
||||
- Use an `asyncio.Semaphore` in the scheduler to bound concurrent nodes.
|
||||
- Return `TeamRunResult.node_results` in graph node order, not completion order.
|
||||
|
||||
The implementation should check shared store concurrency. If the current store is not safe for concurrent async writes, add a narrow lock around session/task/run store writes used by these parallel runs.
|
||||
|
||||
## Validation Prompt
|
||||
|
||||
The validation prompt should consume the full rendered evidence packet, without `[:2500]`, `[:500]`, or `[:12]` fixed caps.
|
||||
|
||||
Required validator instructions:
|
||||
|
||||
- Return only JSON with the validation fields.
|
||||
- If evidence is incomplete, return `insufficient_evidence`.
|
||||
- Only return `rejected` for clear contradiction or clear task failure.
|
||||
- Do not infer fabrication from missing evidence.
|
||||
- Do not claim a source lacks a fact unless the rendered evidence proves that absence.
|
||||
- Treat user feedback as the final business judgment outside automatic validation.
|
||||
|
||||
The validator should still be strict about answer quality when evidence is sufficient.
|
||||
|
||||
## Validation Debug Metadata
|
||||
|
||||
Each `task_validation_snapshotted` event should record:
|
||||
|
||||
- validation result;
|
||||
- validation status;
|
||||
- attempt index;
|
||||
- evidence run ids;
|
||||
- evidence session ids;
|
||||
- tool result count;
|
||||
- evidence character length;
|
||||
- validator raw response;
|
||||
- rendered validation input or prompt, unless a debug setting disables full prompt storage.
|
||||
|
||||
This makes future investigations direct: inspect the exact input the validator saw before interpreting its decision.
|
||||
|
||||
## Log Snapshot Size
|
||||
|
||||
`llm_request_snapshotted` currently stores complete messages and complete tool schemas in both payload and content. That makes logs large and slows inspection.
|
||||
|
||||
Default behavior should change to store a compact payload:
|
||||
|
||||
- iteration;
|
||||
- provider name and model;
|
||||
- message count;
|
||||
- tool names;
|
||||
- message character length;
|
||||
- tool schema character length;
|
||||
- max tokens, temperature, thinking flag.
|
||||
|
||||
Full request snapshots should be controlled by a debug config flag. This does not reduce validation evidence because evidence comes from transcript and tool result events.
|
||||
|
||||
## Testing Plan
|
||||
|
||||
Add or update focused unit tests:
|
||||
|
||||
1. Validation evidence is not fixed-truncated. A fact after the first 500 characters of a tool result still appears in the validator input.
|
||||
2. Missing evidence returns `insufficient_evidence` and moves the task to `needs_review`, not `failed`.
|
||||
3. A team node that ends with `max_tool_iterations` preserves tool evidence in `NodeRunResult.evidence`.
|
||||
4. Team final synthesis defaults to `tools=None` and receives rendered team evidence.
|
||||
5. Parallel team nodes start concurrently under a bounded semaphore and results remain in graph order.
|
||||
6. Tool loop finalization produces a user-visible answer instead of the placeholder stop message.
|
||||
7. Status transitions cover `accepted -> awaiting_feedback`, `insufficient_evidence -> needs_review`, `validator_error -> needs_review`, and terminal `failed`.
|
||||
8. Validation debug events include evidence metadata and validator raw response.
|
||||
|
||||
## Migration Notes
|
||||
|
||||
To reduce risk, implement in layers:
|
||||
|
||||
1. Add evidence models and builders without changing behavior.
|
||||
2. Attach evidence to team node results.
|
||||
3. Switch final synthesis for team plans to no-tools evidence synthesis.
|
||||
4. Switch validation to evidence packets and new statuses.
|
||||
5. Add no-tools finalization for tool iteration limits.
|
||||
6. Add limited isolated-loop parallel execution.
|
||||
7. Slim `llm_request_snapshotted` behind a debug flag.
|
||||
|
||||
This order keeps each change testable and lets the old transcript-summary path remain as a temporary fallback while evidence packets are introduced.
|
||||
Reference in New Issue
Block a user