feat(app-instance): 添加Outlook MCP调用超时配置选项 新增OUTLOOK_MCP_CALL_TIMEOUT_SECONDS环境变量,默认值为60秒, 用于控制后端等待Outlook MCP调用的超时时间。 在create-instance.sh脚本中添加了相应的命令行参数解析和处理逻辑, 同时更新了deploy-control组件的相关配置和测试用例。 BREAKING CHANGE: 新增配置项可能需要现有部署进行相应调整。 ```
400 lines
27 KiB
HTML
400 lines
27 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="zh-CN" class="replay-root">
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||
<title>Beaver Skill Replay Eval · 技术方案介绍</title>
|
||
<link rel="stylesheet" href="assets/fonts.css">
|
||
<link rel="stylesheet" href="assets/base.css">
|
||
<link rel="stylesheet" href="assets/animations/animations.css">
|
||
<link rel="stylesheet" href="style.css">
|
||
</head>
|
||
<body class="tpl-beaver-replay">
|
||
<div class="deck">
|
||
|
||
<section class="slide" data-title="Cover">
|
||
<p class="kicker">Beaver Project / Skill Learning</p>
|
||
<h1 class="h1">Skill Replay Eval<br><span class="gradient-text">从文本评分到行为证据</span></h1>
|
||
<p class="lede">让技能草稿评估真正覆盖任务执行、工具调用、副作用和草稿保真,而不是只看生成文本是否“像一个好技能”。</p>
|
||
<div class="speaker">
|
||
<div class="av"></div>
|
||
<div><b>Beaver 技能学习评估方案</b><span>技术分享 · 架构设计 · UI 与测试路线</span></div>
|
||
</div>
|
||
<div class="deck-footer"><span>skill-replay-eval-design.md</span><span class="slide-number" data-current="1" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
这一页先建立主题。我们不是在做一个更复杂的“打分器”,而是在把技能评估从静态文本判断,推进到真实任务行为判断。核心信息有三层:第一,技能草稿要在历史任务中复跑;第二,所有工具都要被纳入覆盖,只是安全工具真实执行,危险工具用替身判断;第三,修改已有技能时必须保留原有关键内容,不能因为重新生成而把重要说明丢掉。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Project Context">
|
||
<p class="kicker">system context</p>
|
||
<h2 class="h2">Beaver 是多实例 AI 工作台,Skill Replay Eval 位于单实例技能学习链路。</h2>
|
||
<div class="flow mt-l">
|
||
<div class="flow-step"><span class="n">01</span><h4>auth-portal</h4><p>用户注册、登录和模型配置引导入口。</p></div>
|
||
<div class="flow-step"><span class="n">02</span><h4>authz-service</h4><p>账号、后端身份和内部授权编排。</p></div>
|
||
<div class="flow-step"><span class="n">03</span><h4>deploy-control</h4><p>创建、配置和管理独立 app-instance 容器。</p></div>
|
||
<div class="flow-step"><span class="n">04</span><h4>router-proxy</h4><p>按实例域名把外部流量转发到对应容器。</p></div>
|
||
<div class="flow-step card-accent"><span class="n">05</span><h4>app-instance</h4><p>单用户工作台,包含前端、后端、workspace 和 skills。</p></div>
|
||
</div>
|
||
<div class="grid g3 mt-l">
|
||
<div class="card card-accent"><h4>技能页</h4><p class="dim">处理 published skills、candidates、draft review。</p></div>
|
||
<div class="card card-accent"><h4>学习管线</h4><p class="dim">从历史任务发现候选,合成草稿,再进行安全和评估门禁。</p></div>
|
||
<div class="card card-accent"><h4>本方案</h4><p class="dim">增强草稿评估,给发布前审查提供可追溯证据。</p></div>
|
||
</div>
|
||
<div class="deck-footer"><span>project boundary: app-instance/backend + app-instance/frontend</span><span class="slide-number" data-current="2" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
先把项目放在整体架构里。Beaver 的外层是多实例部署系统,用户最终进入自己的 app-instance。技能学习、草稿评审、发布门禁都发生在 app-instance 内。也就是说 Replay Eval 不是登录系统或部署控制面功能,它服务的是单用户实例里的技能生命周期:发现候选,生成草稿,评估草稿,人工审核,然后发布为新的技能版本。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Current Gap">
|
||
<p class="kicker">current_state.py</p>
|
||
<h2 class="h2">旧评估的问题:它在评“草稿文本”,不是评“任务结果”。</h2>
|
||
<div class="compare mt-l">
|
||
<div class="side">
|
||
<span class="tag bad">heuristic-only</span>
|
||
<h3 class="mt-m">现状</h3>
|
||
<ul class="clean mt-m">
|
||
<li>只从 candidate.source_run_ids 构造轻量报告。</li>
|
||
<li>历史 run 使用 validation_result.score 或 success fallback。</li>
|
||
<li>候选得分主要估算自 draft text。</li>
|
||
<li>不复跑任务,不执行工具,不比较旧技能和新草稿行为。</li>
|
||
</ul>
|
||
</div>
|
||
<div class="side candidate">
|
||
<span class="tag good">behavior evidence</span>
|
||
<h3 class="mt-m">目标</h3>
|
||
<ul class="clean mt-m">
|
||
<li>用历史任务作为 eval cases,运行 baseline 与 candidate 两个 arm。</li>
|
||
<li>安全工具真实执行,危险或不可用工具进入 surrogate 判断。</li>
|
||
<li>报告分开披露执行覆盖、替身覆盖、阻塞覆盖和置信度。</li>
|
||
<li>修订和合并草稿必须证明没有无理由丢失原技能内容。</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
<div class="deck-footer"><span>risk: high-looking score can still hide tool regressions</span><span class="slide-number" data-current="3" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
这一页强调为什么要改。旧逻辑并非没有价值,它能快速给一个草稿粗略打分。但它无法回答发布前最关键的问题:这个技能真的会让任务做得更好吗?会不会漏掉工具调用?会不会多做一次危险操作?会不会把原技能里的安全说明删了?所以新方案的目标不是替换所有历史字段,而是在兼容旧 UI 的基础上新增 replay 证据。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Eval Model">
|
||
<p class="kicker">evaluation_model</p>
|
||
<h2 class="h2">核心模型:历史 case 选择 + 双臂 replay。</h2>
|
||
<div class="split mt-l">
|
||
<div class="card">
|
||
<h3>Case selection</h3>
|
||
<div class="matrix mt-m">
|
||
<div class="head">候选类型</div><div class="head">选择来源</div><div class="head">优先级</div><div class="head">规模</div>
|
||
<div class="rowhead">new_skill</div><div>source runs + 相似主题 accepted runs</div><div>相似任务主题</div><div>最多 10 个</div>
|
||
<div class="rowhead">revise_skill</div><div>激活目标 skill/version 的 accepted runs</div><div>近期优先,再按任务分散</div><div>最多 10 个</div>
|
||
<div class="rowhead">merge_skills</div><div>相关 skills 共同激活的 accepted runs</div><div>共同出现证据</div><div>最多 10 个</div>
|
||
</div>
|
||
</div>
|
||
<div class="card">
|
||
<h3>Two arms</h3>
|
||
<div class="pipeline mt-m">
|
||
<div class="phase"><span class="tag">same task</span><h3>输入一致</h3><p class="dim">同一任务文本、同一历史上下文、同一模型设置、同一最大工具迭代数。</p></div>
|
||
<div class="phase"><span class="tag warn">baseline</span><h3>旧行为</h3><p class="dim">new_skill 无技能;revise_skill 用旧技能;merge_skills 用旧相关技能。</p></div>
|
||
<div class="phase"><span class="tag good">candidate</span><h3>新草稿</h3><p class="dim">将 draft skill 作为 pinned draft guidance 注入,并记录行为差异。</p></div>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="deck-footer"><span>same inputs, different skill guidance, comparable outputs</span><span class="slide-number" data-current="4" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
新评估的基本单元是 historical case。每个 case 都跑两个 arm:baseline 代表当前已有能力,candidate 代表注入草稿后的能力。为了让对比有意义,两个 arm 除了技能引导不同以外,其他条件都尽量一致。这样我们评估的不是模型随机性,也不是不同上下文造成的差异,而是草稿技能本身对任务行为的影响。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Tool Modes">
|
||
<p class="kicker">tool_policy</p>
|
||
<h2 class="h2">所有工具都被覆盖,只是进入不同执行模式。</h2>
|
||
<div class="metric-grid mt-l">
|
||
<div class="metric"><span>mode</span><b>executed</b><p class="dim">能安全隔离的工具,在 replay 环境中真实执行。</p></div>
|
||
<div class="metric"><span>mode</span><b>surrogate</b><p class="dim">不能安全执行,但记录 intended call,由 LLM 替身判断效果。</p></div>
|
||
<div class="metric"><span>mode</span><b>blocked</b><p class="dim">既不能执行,也无法可靠判断,明确降低置信度。</p></div>
|
||
<div class="metric"><span>report</span><b>coverage</b><p class="dim">执行、替身、阻塞覆盖率分开披露。</p></div>
|
||
</div>
|
||
<div class="matrix mt-l">
|
||
<div class="head">工具类型</div><div class="head">默认模式</div><div class="head">原因</div><div class="head">例子</div>
|
||
<div class="rowhead">workspace read/write</div><div><span class="tag good">executed</span></div><div>可在临时 workspace clone 中执行</div><div>读写代码、生成文件、测试 artifact</div>
|
||
<div class="rowhead">web/search read</div><div><span class="tag good">executed</span></div><div>只读且可缓存输出</div><div>搜索、打开网页、读取公开文档</div>
|
||
<div class="rowhead">external write</div><div><span class="tag warn">surrogate</span></div><div>默认不能写生产第三方系统</div><div>邮件、日历、消息发送</div>
|
||
<div class="rowhead">destructive action</div><div><span class="tag bad">surrogate / blocked</span></div><div>删除、支付、权限变更不可自动执行</div><div>不可逆外部写入</div>
|
||
</div>
|
||
<div class="deck-footer"><span>principle: include third-party tools without production side effects</span><span class="slide-number" data-current="5" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
一个重要原则是“不要忽略工具”。如果只评估安全工具,那第三方连接器、发送类工具和破坏性工具就会从报告里消失,风险反而被掩盖。这里的做法是:能隔离的就真实执行;不能隔离的就记录意图,并用替身评估它是否正确、完整、必要、有无风险;如果替身也无法判断,就明确标记 blocked,让报告和发布门禁知道置信度不足。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Replay Environment">
|
||
<p class="kicker">replay_environment</p>
|
||
<h2 class="h2">隔离 replay 环境:让任务复跑有证据,但不污染真实世界。</h2>
|
||
<div class="split mt-l">
|
||
<div>
|
||
<p class="lede">每个 case、每个 arm 都创建独立状态。runner 负责执行、拦截、记录,评估器在 runner 外部读取 artifacts 和 side effects。</p>
|
||
<div class="grid g2 mt-l">
|
||
<div class="panel"><span class="tag">session</span><h4 class="mt-s">Temporary session id</h4><p class="dim">避免污染真实会话状态。</p></div>
|
||
<div class="panel"><span class="tag">workspace</span><h4 class="mt-s">Temporary workspace root</h4><p class="dim">文件读写落在临时 clone。</p></div>
|
||
<div class="panel"><span class="tag">trace</span><h4 class="mt-s">Tool call trace</h4><p class="dim">记录调用、参数、模式和理由。</p></div>
|
||
<div class="panel"><span class="tag">journal</span><h4 class="mt-s">Side-effect journal</h4><p class="dim">外部写入不落生产,只留证据。</p></div>
|
||
</div>
|
||
</div>
|
||
<div class="terminal">
|
||
<div class="bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span>replay_case.json</span></div>
|
||
<pre>{
|
||
<span class="str">"case_id"</span>: <span class="str">"run-accepted-042"</span>,
|
||
<span class="str">"arm"</span>: <span class="str">"candidate"</span>,
|
||
<span class="str">"workspace_root"</span>: <span class="str">"/tmp/beaver-replay/..."</span>,
|
||
<span class="str">"tool_calls"</span>: [
|
||
{ <span class="str">"tool"</span>: <span class="str">"filesystem.write"</span>,
|
||
<span class="str">"mode"</span>: <span class="str">"executed"</span> },
|
||
{ <span class="str">"tool"</span>: <span class="str">"outlook.send_mail"</span>,
|
||
<span class="str">"mode"</span>: <span class="str">"surrogate"</span> }
|
||
],
|
||
<span class="str">"artifacts"</span>: [<span class="str">"draft.md"</span>],
|
||
<span class="str">"side_effects"</span>: [<span class="str">"intended_email"</span>]
|
||
}</pre>
|
||
</div>
|
||
</div>
|
||
<div class="deck-footer"><span>pattern: OfficeBench-style isolated testbed, adapted to Beaver skills</span><span class="slide-number" data-current="6" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
这里可以用一句话概括:runner 是沙盒执行器,evaluator 是证据裁判。我们不让 replay 直接写用户 workspace、用户文件、长期 memory、第三方账号或外部系统。每个 arm 有自己的临时 workspace、工具 trace、输出 artifacts 和 side-effect journal。这个形状借鉴了 OfficeBench MCP 的思路,但不绑定固定 benchmark 函数,而是服务 Beaver 的历史任务。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Surrogate Evaluation">
|
||
<p class="kicker">surrogate_evaluator</p>
|
||
<h2 class="h2">替身评估不是跳过工具,而是评估 intended effect。</h2>
|
||
<div class="split mt-l">
|
||
<div class="card">
|
||
<h3>输入 payload</h3>
|
||
<ul class="clean mt-m">
|
||
<li>工具名称和 schema。</li>
|
||
<li>候选 arm 计划调用的 arguments。</li>
|
||
<li>工具被分类为 surrogate 的原因。</li>
|
||
<li>历史 accepted evidence 和任务预期副作用。</li>
|
||
<li>assistant 在调用前后的 rationale。</li>
|
||
</ul>
|
||
</div>
|
||
<div class="card card-accent">
|
||
<h3>判断维度</h3>
|
||
<ul class="clean mt-m">
|
||
<li>调用是否满足任务目标。</li>
|
||
<li>参数是否完整、正确、可执行。</li>
|
||
<li>是否遗漏、重复或不必要。</li>
|
||
<li>是否有越权、危险或不可逆风险。</li>
|
||
<li>candidate 相比 baseline 是否真实改善。</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
<div class="terminal mt-l">
|
||
<div class="bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span>surrogate_score.py</span></div>
|
||
<pre>candidate_score = quality(intended_effect, arguments, evidence)
|
||
risk_penalty = risk(missing_args, duplicated_calls, unsafe_scope)
|
||
confidence = lower_than_real_execution
|
||
|
||
case_score = candidate_score - risk_penalty</pre>
|
||
</div>
|
||
<div class="deck-footer"><span>surrogate contributes to score, but lowers confidence</span><span class="slide-number" data-current="7" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
替身评估要避免两个误解。第一,它不是把危险工具当作成功,而是把“模型打算怎么调用工具”拿出来审。第二,它的置信度天然低于真实执行。比如发送邮件不能真的发,但我们可以判断收件人、主题、正文、附件、发送时机是否符合任务,也能判断是否重复发送、是否包含敏感信息、是否越权。这些判断会进入分数,但会降低 confidence。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Scoring And Gates">
|
||
<p class="kicker">report_and_gates</p>
|
||
<h2 class="h2">报告兼容旧 UI,同时新增 replay 证据和发布门禁信号。</h2>
|
||
<div class="metric-grid mt-l">
|
||
<div class="metric"><span>legacy</span><b>passed</b><p class="dim">保留 passed、score_delta、cases、status 等旧字段。</p></div>
|
||
<div class="metric"><span>coverage</span><b>3 modes</b><p class="dim">execution、surrogate、blocked coverage。</p></div>
|
||
<div class="metric"><span>quality</span><b>delta</b><p class="dim">baseline mean、candidate mean、improved/regression count。</p></div>
|
||
<div class="metric"><span>trust</span><b>confidence</b><p class="dim">低置信度不能自动等同可发布。</p></div>
|
||
</div>
|
||
<div class="split mt-l">
|
||
<div class="terminal">
|
||
<div class="bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span>SkillDraftEvalReport</span></div>
|
||
<pre>{
|
||
<span class="str">"eval_version"</span>: <span class="str">"replay-v1"</span>,
|
||
<span class="str">"mode"</span>: <span class="str">"replay"</span>,
|
||
<span class="str">"execution_coverage"</span>: <span class="num">0.58</span>,
|
||
<span class="str">"surrogate_coverage"</span>: <span class="num">0.33</span>,
|
||
<span class="str">"blocked_coverage"</span>: <span class="num">0.09</span>,
|
||
<span class="str">"confidence"</span>: <span class="str">"medium"</span>,
|
||
<span class="str">"case_reports"</span>: [...],
|
||
<span class="str">"tool_mode_summary"</span>: {...}
|
||
}</pre>
|
||
</div>
|
||
<div class="card card-accent">
|
||
<h3>Publish gate logic</h3>
|
||
<ul class="clean mt-m">
|
||
<li>高分但低置信度,进入更强人工 review。</li>
|
||
<li>重要工具全部 blocked,不允许自动 pass。</li>
|
||
<li>部分 case 失败,status = partial,降低 confidence。</li>
|
||
<li>replay 基础设施失败,status = replay_error。</li>
|
||
<li>provider 不可用时保留 skipped-provider 行为,但标明没有 replay evidence。</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
<div class="deck-footer"><span>score answers quality, confidence answers how much we can trust it</span><span class="slide-number" data-current="8" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
报告设计的关键是兼容而不是推倒重来。旧字段继续保留,这样 UI 和存储不会被迫一次性迁移。新字段回答三个问题:我们执行了多少工具?多少工具只能替身判断?多少工具被阻塞?最后 confidence 会参与发布门禁。也就是说,一个草稿即使分数看起来不错,如果证据主要来自替身,或者关键工具全 blocked,也不应该自动放行。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Draft Preservation">
|
||
<p class="kicker">preservation_contract</p>
|
||
<h2 class="h2">修订草稿必须保留原技能内容,除非明确说明改变理由。</h2>
|
||
<div class="split mt-l">
|
||
<div class="card">
|
||
<h3>为什么需要</h3>
|
||
<p class="lede mt-m">现有 synthesizer 对 revision 和 merge 只拿到候选理由、相关技能名、工具名、任务摘要和 session excerpts,没有完整 base skill body。</p>
|
||
<div class="panel mt-m"><span class="tag bad">failure mode</span><p class="dim mt-s">重新生成看起来更整洁,但可能丢掉原技能里的安全边界、工具约束、工作流顺序和异常处理。</p></div>
|
||
</div>
|
||
<div class="card card-accent">
|
||
<h3>新合同</h3>
|
||
<ul class="clean mt-m">
|
||
<li>向 synthesis prompt 传入 base skill name、version、完整 frontmatter 和完整 content。</li>
|
||
<li>要求模型默认保留现有指令,只有在明确理由下才改变。</li>
|
||
<li>输出 preserved_sections、changed_sections、dropped_sections、change_reason。</li>
|
||
<li>preservation checker 比较 base 和 draft,未解释的关键丢失会标记风险。</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
<div class="deck-footer"><span>revision is an edit contract, not a fresh rewrite</span><span class="slide-number" data-current="9" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
这一页是第二条主线:草稿保真。Replay 评估解决“行为是否更好”,preservation 解决“原来重要的东西有没有丢”。对于修订和合并技能,模型必须看到完整 base skill,不能只看摘要。生成结果也不只是一个新 body,还要解释保留了什么、改了什么、删了什么以及为什么删。这样评审者能判断这是合理修订,还是无意丢失。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Implementation">
|
||
<p class="kicker">implementation_plan</p>
|
||
<h2 class="h2">实现路径:扩展现有技能学习管线,新增小模块而不是重写系统。</h2>
|
||
<div class="pipeline mt-l">
|
||
<div class="phase card-accent">
|
||
<span class="tag">backend model</span>
|
||
<h3>数据层兼容</h3>
|
||
<ul class="clean">
|
||
<li>扩展 SkillDraftEvalReport。</li>
|
||
<li>新增 replay 字段默认值。</li>
|
||
<li>保留旧 payload 读取能力。</li>
|
||
</ul>
|
||
</div>
|
||
<div class="phase card-accent">
|
||
<span class="tag">learning helpers</span>
|
||
<h3>评估能力拆分</h3>
|
||
<ul class="clean">
|
||
<li>case_selection.py</li>
|
||
<li>preservation.py</li>
|
||
<li>replay.py</li>
|
||
<li>surrogate.py</li>
|
||
</ul>
|
||
</div>
|
||
<div class="phase card-accent">
|
||
<span class="tag">pipeline + ui</span>
|
||
<h3>接入和展示</h3>
|
||
<ul class="clean">
|
||
<li>eval.py 编排 replay。</li>
|
||
<li>pipeline.py 更新发布门禁。</li>
|
||
<li>loop.py 支持 replay executor override。</li>
|
||
<li>Skills UI 展示 coverage、case 和 preservation。</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
<div class="terminal mt-l">
|
||
<div class="bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span>target files</span></div>
|
||
<pre>app-instance/backend/beaver/skills/learning/
|
||
case_selection.py
|
||
preservation.py
|
||
replay.py
|
||
surrogate.py
|
||
eval.py
|
||
synthesizer.py
|
||
pipeline.py
|
||
|
||
app-instance/frontend/app/(app)/skills/page.tsx
|
||
app-instance/frontend/types/index.ts</pre>
|
||
</div>
|
||
<div class="deck-footer"><span>architecture: focused helpers under existing SkillLearningPipelineService.evaluate_draft()</span><span class="slide-number" data-current="10" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
实现策略很保守:扩展现有技能学习管线,而不是引入一套平行系统。数据模型先兼容扩展,然后把 case selection、preservation、replay、surrogate 拆成小模块。eval.py 负责 orchestrate,pipeline.py 负责发布门禁,loop.py 给 replay executor 一个注入点。前端只是在草稿评审页增加证据展示,并保留已有摘要入口。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="UI">
|
||
<p class="kicker">skills_review_ui</p>
|
||
<h2 class="h2">评审页先给结论,再让审核者展开证据。</h2>
|
||
<div class="split mt-l">
|
||
<div class="card card-accent">
|
||
<h3>Concise summary</h3>
|
||
<div class="metric-grid mt-m" style="grid-template-columns: repeat(2, 1fr);">
|
||
<div class="metric"><span>pass</span><b>failed</b><p class="dim">发布门禁结论。</p></div>
|
||
<div class="metric"><span>delta</span><b>+0.18</b><p class="dim">candidate - baseline。</p></div>
|
||
<div class="metric"><span>exec</span><b>58%</b><p class="dim">真实执行覆盖。</p></div>
|
||
<div class="metric"><span>conf</span><b>medium</b><p class="dim">证据可信度。</p></div>
|
||
</div>
|
||
</div>
|
||
<div class="card">
|
||
<h3>Detailed evidence</h3>
|
||
<ul class="clean mt-m">
|
||
<li>Replay cases:每个历史任务的 baseline/candidate 分数。</li>
|
||
<li>Tool calls by mode:executed、surrogate、blocked 分类和理由。</li>
|
||
<li>Artifacts and side effects:产物和副作用 journal。</li>
|
||
<li>Preservation report:修订草稿的保留、变更、删除风险。</li>
|
||
<li>Raw eval payload:保留完整 JSON 供深度排查。</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
<div class="panel mt-l">
|
||
<span class="tag good">UX principle</span>
|
||
<p class="lede mt-s">用户不需要预先配置每个工具的策略。系统在报告里解释覆盖、阻塞和不确定性,让审核者知道该相信多少。</p>
|
||
</div>
|
||
<div class="deck-footer"><span>page: app-instance/frontend/app/(app)/skills/page.tsx</span><span class="slide-number" data-current="11" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
UI 的目标不是把评审页变复杂,而是让复杂证据有层次。最上面仍然是 pass/fail、baseline mean、candidate mean、delta、coverage、confidence。审核者需要细看时,才展开 replay cases、工具模式、阻塞理由、artifacts、副作用和 preservation report。普通使用者不需要理解每个工具策略,系统在评估后解释不确定性。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide" data-title="Testing And Next Steps">
|
||
<p class="kicker">testing_strategy</p>
|
||
<h2 class="h2">交付标准:覆盖决策逻辑与真实工具形态。</h2>
|
||
<div class="roadmap mt-l">
|
||
<div class="item"><span>01</span><b>模型兼容</b><p>旧 payload 可读,新 replay 字段可写,UI 不被旧数据破坏。</p></div>
|
||
<div class="item"><span>02</span><b>核心逻辑</b><p>case selection、arm construction、tool mode aggregation、surrogate payload。</p></div>
|
||
<div class="item"><span>03</span><b>保真检查</b><p>base section 保留、删除风险、publish gate 对 preservation risk 的处理。</p></div>
|
||
<div class="item"><span>04</span><b>混合工具</b><p>安全文件写真实执行,外部写入被拦截为 surrogate,生成可审计报告。</p></div>
|
||
</div>
|
||
<div class="split mt-l">
|
||
<div class="card card-accent"><h3>Unit tests</h3><p class="dim">历史 case 选择、双臂构造、工具模式分类、替身评分 payload、保真检查、低置信发布门禁。</p></div>
|
||
<div class="card card-accent"><h3>Integration-style tests</h3><p class="dim">stub filesystem write、stub external write、candidate 同时改善真实 artifact 和 surrogate side effect。</p></div>
|
||
</div>
|
||
<div class="deck-footer"><span>next: implement tasks 1-12 from the current plan</span><span class="slide-number" data-current="12" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
这一页收束到交付标准。测试不只测 happy path,而是围绕风险边界:模型兼容、case 选择、双臂构造、工具模式聚合、替身 payload、保真检查、发布门禁,以及混合工具场景。我们需要证明这套评估不仅能生成报告,而且能正确处理安全工具、外部写入和低置信度边界。
|
||
</aside>
|
||
</section>
|
||
|
||
<section class="slide center tc" data-title="Questions">
|
||
<div>
|
||
<div class="center-mark">Q</div>
|
||
<h2 class="h2 mt-m">Questions</h2>
|
||
<p class="lede" style="margin-left:auto;margin-right:auto;">技能发布前,不只要看草稿写得好不好,还要看它在历史任务里做了什么、没做什么、敢不敢相信。</p>
|
||
<div class="row mt-l" style="justify-content:center">
|
||
<span class="tag good">behavior evidence</span>
|
||
<span class="tag">tool coverage</span>
|
||
<span class="tag warn">confidence gates</span>
|
||
<span class="tag">draft preservation</span>
|
||
</div>
|
||
</div>
|
||
<div class="deck-footer"><span>Beaver Skill Replay Eval</span><span class="slide-number" data-current="13" data-total="13"></span></div>
|
||
<aside class="notes">
|
||
最后一页只留核心判断,方便收尾和进入问答。可以用一句话结束:Replay Eval 让技能发布从“相信生成结果”变成“审查行为证据”。然后邀请大家针对工具策略、隔离环境、替身判断、发布门禁或 UI 展示提问。
|
||
</aside>
|
||
</section>
|
||
|
||
</div>
|
||
<script src="assets/runtime.js"></script>
|
||
</body>
|
||
</html>
|