feat: 支持多语言提示词本地化和界面优化

- 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化 - 移除内置 agents 配置以简化系统架构 - 更新 ContextBuilder 使用动态提示词模板而非硬编码内容 - 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数 - 添加输出语言指令确保用户界面内容按指定语言生成 - 扩展前端 LanguageSwitcher 组件支持三种语言选项 - 优化 Header 和侧边栏组件的响应式布局和文本截断处理 - 更新测试用例验证不同语言环境下的提示词正确性
```
2026-06-10 16:11:05 +08:00 · 2026-06-09 14:23:37 +08:00 · 2026-06-09 13:19:09 +08:00 · 2026-06-08 13:38:10 +08:00 · 2026-06-08 13:36:55 +08:00 · 2026-06-08 13:35:58 +08:00
85 changed files with 13190 additions and 285 deletions
--- a/app-instance/backend/agents/registry.json
+++ b/app-instance/backend/agents/registry.json
@ -1,145 +1,4 @@
 {
-  "agents": [
-    {
-      "agent_id": "researcher",
-      "capabilities": [
-        "research",
-        "analysis",
-        "source review",
-        "requirements"
-      ],
-      "created_at": "2026-05-27T05:25:11.756341+00:00",
-      "description": "Finds facts, references, constraints, and implementation options.",
-      "display_name": "Researcher",
-      "metadata": {},
-      "model": null,
-      "name": "researcher",
-      "priority": 50,
-      "provider_name": null,
-      "role": "research",
-      "skill_names": [],
-      "source": "builtin",
-      "status": "active",
-      "system_prompt": "You are a research specialist. Gather concise evidence and tradeoffs for the parent task.",
-      "tags": [
-        "planning",
-        "research"
-      ],
-      "tool_hints": [],
-      "updated_at": "2026-05-27T05:25:11.756349+00:00"
-    },
-    {
-      "agent_id": "implementer",
-      "capabilities": [
-        "implementation",
-        "coding",
-        "refactor",
-        "integration"
-      ],
-      "created_at": "2026-05-27T05:25:11.756351+00:00",
-      "description": "Builds scoped implementation slices and proposes concrete changes.",
-      "display_name": "Implementer",
-      "metadata": {},
-      "model": null,
-      "name": "implementer",
-      "priority": 45,
-      "provider_name": null,
-      "role": "implementation",
-      "skill_names": [],
-      "source": "builtin",
-      "status": "active",
-      "system_prompt": "You are an implementation specialist. Produce practical, scoped implementation output.",
-      "tags": [
-        "coding",
-        "build"
-      ],
-      "tool_hints": [],
-      "updated_at": "2026-05-27T05:25:11.756353+00:00"
-    },
-    {
-      "agent_id": "reviewer",
-      "capabilities": [
-        "review",
-        "quality",
-        "risk",
-        "verification"
-      ],
-      "created_at": "2026-05-27T05:25:11.756355+00:00",
-      "description": "Reviews plans, code, outputs, and risks before final synthesis.",
-      "display_name": "Reviewer",
-      "metadata": {},
-      "model": null,
-      "name": "reviewer",
-      "priority": 45,
-      "provider_name": null,
-      "role": "review",
-      "skill_names": [],
-      "source": "builtin",
-      "status": "active",
-      "system_prompt": "You are a review specialist. Focus on defects, missing requirements, and risks.",
-      "tags": [
-        "review",
-        "quality"
-      ],
-      "tool_hints": [],
-      "updated_at": "2026-05-27T05:25:11.756356+00:00"
-    },
-    {
-      "agent_id": "tester",
-      "capabilities": [
-        "testing",
-        "verification",
-        "regression",
-        "qa"
-      ],
-      "created_at": "2026-05-27T05:25:11.756358+00:00",
-      "description": "Designs and executes verification checks for task outputs.",
-      "display_name": "Tester",
-      "metadata": {},
-      "model": null,
-      "name": "tester",
-      "priority": 40,
-      "provider_name": null,
-      "role": "testing",
-      "skill_names": [],
-      "source": "builtin",
-      "status": "active",
-      "system_prompt": "You are a testing specialist. Identify focused checks and report pass/fail evidence.",
-      "tags": [
-        "test",
-        "quality"
-      ],
-      "tool_hints": [],
-      "updated_at": "2026-05-27T05:25:11.756358+00:00"
-    },
-    {
-      "agent_id": "documenter",
-      "capabilities": [
-        "documentation",
-        "explanation",
-        "migration notes",
-        "release notes"
-      ],
-      "created_at": "2026-05-27T05:25:11.756360+00:00",
-      "description": "Writes and reconciles user-facing and internal documentation updates.",
-      "display_name": "Documenter",
-      "metadata": {},
-      "model": null,
-      "name": "documenter",
-      "priority": 35,
-      "provider_name": null,
-      "role": "documentation",
-      "skill_names": [],
-      "source": "builtin",
-      "status": "active",
-      "system_prompt": "You are a documentation specialist. Produce concise docs aligned with the implementation.",
-      "tags": [
-        "docs",
-        "communication"
-      ],
-      "tool_hints": [],
-      "updated_at": "2026-05-27T05:25:11.756360+00:00"
-    }
-  ],
+  "agents": [],
  "version": 1
 }
--- a/app-instance/backend/beaver/engine/context/builder.py
+++ b/app-instance/backend/beaver/engine/context/builder.py
@ -27,13 +27,7 @@ from dataclasses import dataclass, field
 from typing import Any

 from beaver.memory.curated.snapshot import MemorySnapshot
-
-
-BEAVER_USER_ASSISTANT_IDENTITY_PROMPT = (
-    "You are 海狸 (Beaver), an AI assistant developed by 博维资讯系统有限公司. "
-    "When communicating with users, keep this identity consistent. "
-    "If users ask who you are, say that you are 海狸 (Beaver), 博维资讯系统有限公司研发的 AI 助手."
-)
+from beaver.prompts import get_main_agent_prompt


@dataclass(slots=True)
@ -113,6 +107,7 @@ class ContextBuildInput:
    """

    base_system_prompt: str = ""
+    prompt_locale: str | None = None
    history: list[dict[str, Any]] = field(default_factory=list)
    current_user_input: str | list[dict[str, Any]] | None = None
    memory_snapshot: MemorySnapshot | None = None
@ -171,7 +166,7 @@ class ContextBuilder:
        - activated skill 正文放到显式消息里，避免 system prompt 持续膨胀
        """

-        sections: list[str] = [BEAVER_USER_ASSISTANT_IDENTITY_PROMPT]
+        sections: list[str] = [get_main_agent_prompt(build_input.prompt_locale)]

        base_system_prompt = (build_input.base_system_prompt or "").strip()
        if base_system_prompt:
--- a/app-instance/backend/beaver/engine/loop.py
+++ b/app-instance/backend/beaver/engine/loop.py
@ -224,6 +224,7 @@ class AgentLoop:
        title: str | None = None,
        execution_context: str | None = None,
        skill_selection_context: str | None = None,
+        prompt_locale: str | None = None,
        model: str | None = None,
        provider_name: str | None = None,
        api_key: str | None = None,
@ -247,6 +248,7 @@ class AgentLoop:
        attempt_index: int | None = None,
        pinned_skill_names: list[str] | None = None,
        pinned_skill_contexts: list[SkillContext] | None = None,
+        tool_executor_override: Any = None,
        allow_candidate_generation: bool = False,
        intent_agent_decision: dict[str, Any] | None = None,
        channel_identity: ChannelIdentity | None = None,
@ -274,6 +276,7 @@ class AgentLoop:
            title=title,
            execution_context=execution_context,
            skill_selection_context=skill_selection_context,
+            prompt_locale=prompt_locale,
            model=model,
            provider_name=provider_name,
            api_key=api_key,
@ -297,6 +300,7 @@ class AgentLoop:
            attempt_index=attempt_index,
            pinned_skill_names=pinned_skill_names,
            pinned_skill_contexts=pinned_skill_contexts,
+            tool_executor_override=tool_executor_override,
            allow_candidate_generation=allow_candidate_generation,
            intent_agent_decision=intent_agent_decision,
            channel_identity=channel_identity,
@ -312,6 +316,7 @@ class AgentLoop:
        title: str | None = None,
        execution_context: str | None = None,
        skill_selection_context: str | None = None,
+        prompt_locale: str | None = None,
        model: str | None = None,
        provider_name: str | None = None,
        api_key: str | None = None,
@ -335,6 +340,7 @@ class AgentLoop:
        attempt_index: int | None = None,
        pinned_skill_names: list[str] | None = None,
        pinned_skill_contexts: list[SkillContext] | None = None,
+        tool_executor_override: Any = None,
        allow_candidate_generation: bool = False,
        intent_agent_decision: dict[str, Any] | None = None,
        channel_identity: ChannelIdentity | None = None,
@ -354,6 +360,7 @@ class AgentLoop:
        tool_registry = self._require_loaded("tool_registry")
        tool_assembler = self._require_loaded("tool_assembler")
        tool_executor = self._require_loaded("tool_executor")
+        effective_tool_executor = tool_executor_override or tool_executor
        skills_loader = self._require_loaded("skills_loader")
        skill_assembler = self._require_loaded("skill_assembler")
        skill_learning_service = self._require_loaded("skill_learning_service")
@ -568,6 +575,7 @@ class AgentLoop:

            build_input = ContextBuildInput(
                base_system_prompt=self.profile.system_prompt,
+                prompt_locale=prompt_locale,
                history=session_manager.get_history(
                    resolved_session_id,
                    max_messages=max(1, self.profile.max_context_messages),
@ -789,7 +797,7 @@ class AgentLoop:

                iterations += 1
                for tool_call in response.tool_calls:
-                    result = await tool_executor.execute_tool_call(tool_call, context=tool_context)
+                    result = await effective_tool_executor.execute_tool_call(tool_call, context=tool_context)
                    session_manager.append_message(
                        resolved_session_id,
                        run_id=resolved_run_id,
--- a/app-instance/backend/beaver/engine/providers/litellm.py
+++ b/app-instance/backend/beaver/engine/providers/litellm.py
@ -3,9 +3,11 @@
 from __future__ import annotations

 from contextlib import contextmanager
+from ipaddress import ip_address
 import json
 import os
 from typing import Any
+from urllib.parse import urlsplit

 from .base import LLMProvider, LLMResponse, ToolCallRequest
 from .registry import find_by_model, find_by_name, find_gateway
@ -26,6 +28,23 @@ except ModuleNotFoundError:  # pragma: no cover
 _ALLOWED_MSG_KEYS = frozenset({"role", "content", "tool_calls", "tool_call_id", "name", "reasoning_content"})


+def _looks_like_local_vllm_api_base(api_base: str | None) -> bool:
+    if not api_base:
+        return False
+    lowered = api_base.lower()
+    if "vllm" in lowered or "localhost" in lowered:
+        return True
+
+    host = urlsplit(lowered).hostname or ""
+    if host in {"127.0.0.1", "::1", "0.0.0.0"}:
+        return True
+    try:
+        parsed_host = ip_address(host)
+    except ValueError:
+        return False
+    return parsed_host.is_private or parsed_host.is_loopback
+
+
 class LiteLLMProvider(LLMProvider):
    """通过 LiteLLM 统一访问大多数 provider。"""

@ -200,10 +219,12 @@ class LiteLLMProvider(LLMProvider):
        kwargs["extra_body"] = extra_body

    def _uses_mistral_reasoning_parser(self, original_model: str, resolved_model: str) -> bool:
-        if self.provider_name != "vllm":
-            return False
        model_names = f"{original_model} {resolved_model}".lower()
-        return "mistral" in model_names
+        if "mistral" not in model_names:
+            return False
+        if self.provider_name == "vllm":
+            return True
+        return self.provider_name in {"openai", "custom"} and _looks_like_local_vllm_api_base(self.api_base)

    async def chat(
        self,
--- a/app-instance/backend/beaver/interfaces/web/app.py
+++ b/app-instance/backend/beaver/interfaces/web/app.py
@ -50,6 +50,7 @@ from beaver.services.user_file_resolver import (
    build_file_auth_context,
 )
 from beaver.skills.learning import SkillLearningWorker, SkillLearningWorkerConfig
+from beaver.skills.learning.replay import ReplayRunner
 from beaver.skills.catalog.utils import parse_frontmatter

 from .deps import get_agent_service
@ -2080,7 +2081,8 @@ def create_app(
    @app.post("/api/skills/candidates/{candidate_id}/draft")
    async def synthesize_skill_draft(candidate_id: str, request: Request) -> dict[str, Any]:
        agent_service = get_agent_service(request)
-        loaded = agent_service.create_loop().boot()
+        loop = agent_service.create_loop()
+        loaded = loop.boot()
        try:
            candidate = loaded.skill_learning_pipeline.get_candidate(candidate_id)  # type: ignore[union-attr]
            if candidate.draft_skill_name and candidate.draft_id:
@ -2099,6 +2101,7 @@ def create_app(
                draft.skill_name,
                draft.draft_id,
                provider_bundle=provider_bundle,
+                replay_runner=ReplayRunner(agent_loop=loop),
            )
        except ValueError as exc:
            raise HTTPException(status_code=404, detail=str(exc)) from exc
@ -2107,7 +2110,8 @@ def create_app(
    @app.post("/api/skills/candidates/{candidate_id}/regenerate")
    async def regenerate_skill_draft(candidate_id: str, request: Request) -> dict[str, Any]:
        agent_service = get_agent_service(request)
-        loaded = agent_service.create_loop().boot()
+        loop = agent_service.create_loop()
+        loaded = loop.boot()
        provider_bundle = agent_service._make_provider_bundle_for_task(loaded, {})  # noqa: SLF001
        try:
            draft = await loaded.skill_learning_pipeline.regenerate_draft(  # type: ignore[union-attr]
@ -2120,6 +2124,7 @@ def create_app(
                draft.skill_name,
                draft.draft_id,
                provider_bundle=provider_bundle,
+                replay_runner=ReplayRunner(agent_loop=loop),
            )
        except ValueError as exc:
            raise HTTPException(status_code=404, detail=str(exc)) from exc
@ -2458,6 +2463,7 @@ def create_app(
                "user_id": payload.user_id,
                "title": payload.title,
                "execution_context": payload.execution_context,
+                "prompt_locale": payload.prompt_locale,
                "model": payload.model,
                "provider_name": payload.provider_name,
                "embedding_model": payload.embedding_model,
@ -2573,6 +2579,7 @@ def create_app(
                        "user_id": _clean_text(payload.get("user_id")) or None,
                        "title": _clean_text(payload.get("title")) or None,
                        "execution_context": _clean_text(payload.get("execution_context")) or None,
+                        "prompt_locale": _clean_text(payload.get("prompt_locale")) or None,
                        "model": _clean_text(payload.get("model")) or None,
                        "provider_name": _clean_text(payload.get("provider_name")) or None,
                        "embedding_model": _clean_text(payload.get("embedding_model")) or None,
--- a/app-instance/backend/beaver/interfaces/web/schemas/chat.py
+++ b/app-instance/backend/beaver/interfaces/web/schemas/chat.py
@ -55,6 +55,7 @@ class WebChatRequest(BaseModel):
    user_id: str | None = None
    title: str | None = None
    execution_context: str | None = None
+    prompt_locale: str | None = None
    model: str | None = None
    provider_name: str | None = None
    embedding_model: str | None = None
--- a/app-instance/backend/beaver/memory/skills/models.py
+++ b/app-instance/backend/beaver/memory/skills/models.py
@ -227,6 +227,15 @@ class SkillDraftEvalReport:
    cases: list[dict[str, Any]] = field(default_factory=list)
    status: str = "completed"
    created_at: str = ""
+    eval_version: str = "heuristic-v1"
+    mode: str = "heuristic"
+    execution_coverage: float = 0.0
+    surrogate_coverage: float = 0.0
+    blocked_coverage: float = 0.0
+    confidence: str = "low"
+    case_reports: list[dict[str, Any]] = field(default_factory=list)
+    tool_mode_summary: dict[str, Any] = field(default_factory=dict)
+    preservation_report: dict[str, Any] | None = None

    def to_dict(self) -> dict[str, Any]:
        return {
@ -244,6 +253,17 @@ class SkillDraftEvalReport:
            "cases": [dict(item) for item in self.cases],
            "status": self.status,
            "created_at": self.created_at,
+            "eval_version": self.eval_version,
+            "mode": self.mode,
+            "execution_coverage": self.execution_coverage,
+            "surrogate_coverage": self.surrogate_coverage,
+            "blocked_coverage": self.blocked_coverage,
+            "confidence": self.confidence,
+            "case_reports": [dict(item) for item in self.case_reports],
+            "tool_mode_summary": dict(self.tool_mode_summary),
+            "preservation_report": (
+                dict(self.preservation_report) if self.preservation_report is not None else None
+            ),
        }

    @classmethod
@ -263,6 +283,23 @@ class SkillDraftEvalReport:
            cases=[dict(item) for item in payload.get("cases") or [] if isinstance(item, dict)],
            status=str(payload.get("status") or "completed"),
            created_at=str(payload.get("created_at") or ""),
+            eval_version=str(payload.get("eval_version") or "heuristic-v1"),
+            mode=str(payload.get("mode") or "heuristic"),
+            execution_coverage=_bounded_float(payload.get("execution_coverage"), default=0.0),
+            surrogate_coverage=_bounded_float(payload.get("surrogate_coverage"), default=0.0),
+            blocked_coverage=_bounded_float(payload.get("blocked_coverage"), default=0.0),
+            confidence=str(payload.get("confidence") or "low"),
+            case_reports=[
+                dict(item)
+                for item in payload.get("case_reports") or []
+                if isinstance(item, dict)
+            ],
+            tool_mode_summary=dict(payload.get("tool_mode_summary") or {}),
+            preservation_report=(
+                dict(payload["preservation_report"])
+                if isinstance(payload.get("preservation_report"), dict)
+                else None
+            ),
        )


@ -272,6 +309,15 @@ def _optional_str(value: Any) -> str | None:
    return str(value)


+def _bounded_float(value: Any, *, default: float = 0.0) -> float:
+    if value in (None, ""):
+        return default
+    try:
+        return max(0.0, min(1.0, float(value)))
+    except (TypeError, ValueError):
+        return default
+
+
 def _summarize_evidence(payload: dict[str, Any]) -> str:
    evidence = payload.get("evidence")
    if isinstance(evidence, dict):
--- a/app-instance/backend/beaver/prompts/init.py
+++ b/app-instance/backend/beaver/prompts/init.py
@ -0,0 +1,5 @@
+"""Prompt templates used by Beaver runtime components."""
+
+from .main_agent import get_main_agent_prompt
+
+__all__ = ["get_main_agent_prompt"]
--- a/app-instance/backend/beaver/prompts/main_agent.py
+++ b/app-instance/backend/beaver/prompts/main_agent.py
@ -0,0 +1,55 @@
+"""Locale-aware main agent prompt loading."""
+
+from __future__ import annotations
+
+from functools import lru_cache
+from pathlib import Path
+
+DEFAULT_MAIN_AGENT_PROMPT_LOCALE = "zh-Hans"
+
+_PROMPT_FILES = {
+    "zh-Hans": "zh-Hans.md",
+    "zh-Hant": "zh-Hant.md",
+    "en": "en.md",
+}
+
+_LOCALE_ALIASES = {
+    "zh": "zh-Hans",
+    "zh-cn": "zh-Hans",
+    "zh-hans": "zh-Hans",
+    "zh-sg": "zh-Hans",
+    "zh-hant": "zh-Hant",
+    "zh-tw": "zh-Hant",
+    "zh-hk": "zh-Hant",
+    "zh-mo": "zh-Hant",
+    "en": "en",
+    "en-us": "en",
+    "en-gb": "en",
+}
+
+
+def get_main_agent_prompt(locale: str | None = None) -> str:
+    """Return the main-agent identity prompt for a prompt locale."""
+
+    prompt_locale = normalize_main_agent_prompt_locale(locale)
+    return _load_main_agent_prompt(prompt_locale)
+
+
+def normalize_main_agent_prompt_locale(locale: str | None = None) -> str:
+    cleaned = (locale or DEFAULT_MAIN_AGENT_PROMPT_LOCALE).strip()
+    if not cleaned:
+        return DEFAULT_MAIN_AGENT_PROMPT_LOCALE
+    normalized = _LOCALE_ALIASES.get(cleaned.lower())
+    if normalized:
+        return normalized
+    return cleaned if cleaned in _PROMPT_FILES else DEFAULT_MAIN_AGENT_PROMPT_LOCALE
+
+
+@lru_cache(maxsize=len(_PROMPT_FILES))
+def _load_main_agent_prompt(locale: str) -> str:
+    filename = _PROMPT_FILES.get(locale, _PROMPT_FILES[DEFAULT_MAIN_AGENT_PROMPT_LOCALE])
+    path = Path(__file__).with_name("main_agent") / filename
+    if not path.exists():
+        fallback_path = Path(__file__).with_name("main_agent") / _PROMPT_FILES[DEFAULT_MAIN_AGENT_PROMPT_LOCALE]
+        return fallback_path.read_text(encoding="utf-8").strip()
+    return path.read_text(encoding="utf-8").strip()
--- a/app-instance/backend/beaver/prompts/main_agent/en.md
+++ b/app-instance/backend/beaver/prompts/main_agent/en.md
@ -0,0 +1,7 @@
+You are Beaver, an AI assistant developed by Boway Information Systems Co., Ltd.
+
+When communicating with users, keep this identity consistent. If users ask who you are, say that you are Beaver, an AI assistant developed by Boway Information Systems Co., Ltd.
+
+# Language
+
+Use English for user-facing replies, task titles, summaries, plans, and final reports while this prompt is active. If the user explicitly asks for another language, follow that request.
--- a/app-instance/backend/beaver/prompts/main_agent/zh-Hans.md
+++ b/app-instance/backend/beaver/prompts/main_agent/zh-Hans.md
@ -0,0 +1,7 @@
+你是海狸 (Beaver)，博维资讯系统有限公司研发的 AI 助手。
+
+与用户沟通时，保持这个身份一致。用户问你是谁时，说明你是海狸 (Beaver)，博维资讯系统有限公司研发的 AI 助手。
+
+# 语言
+
+使用简体中文进行面向用户的回复、任务标题、摘要、计划和最终报告。若用户明确要求其他语言，则按用户要求执行。
--- a/app-instance/backend/beaver/prompts/main_agent/zh-Hant.md
+++ b/app-instance/backend/beaver/prompts/main_agent/zh-Hant.md
@ -0,0 +1,7 @@
+你是海狸 (Beaver)，博維資訊系統有限公司研發的 AI 助手。
+
+與使用者溝通時，保持這個身份一致。使用者問你是誰時，說明你是海狸 (Beaver)，博維資訊系統有限公司研發的 AI 助手。
+
+# 語言
+
+使用繁體中文進行面向使用者的回覆、任務標題、摘要、計劃和最終報告。若使用者明確要求其他語言，則按使用者要求執行。
--- a/app-instance/backend/beaver/services/agent_service.py
+++ b/app-instance/backend/beaver/services/agent_service.py
@ -22,6 +22,7 @@ from beaver.engine import AgentLoop, AgentProfile, AgentRunResult, EngineLoader
 from beaver.engine.providers import make_provider_bundle
 from beaver.foundation.events import InboundMessage, OutboundMessage
 from beaver.foundation.models import CronJob, CronRunRecord
+from beaver.prompts.main_agent import normalize_main_agent_prompt_locale
 from beaver.tasks import (
    EvidenceBuilder,
    MainAgentRouter,
@ -622,6 +623,7 @@ class AgentService:
                session_id=session_id,
                description=message,
                metadata={
+                    "prompt_locale": normalize_main_agent_prompt_locale(kwargs.get("prompt_locale")),
                    "router_reason": decision.reason,
                    **({"short_title": decision.short_title} if decision.short_title else {}),
                },
@ -749,6 +751,8 @@ class AgentService:
        session_manager = self._require_loaded(loaded, "session_manager")

        base_execution_context = kwargs.get("execution_context")
+        prompt_locale = kwargs.get("prompt_locale") or task.metadata.get("prompt_locale")
+        output_language_instruction = self._output_language_instruction(prompt_locale)
        provider_bundle = kwargs.get("provider_bundle") or self._make_provider_bundle_for_task(loaded, kwargs)
        kwargs = dict(kwargs)
        team_provider_bundle_factory = kwargs.pop("team_provider_bundle_factory", None)
@ -843,8 +847,11 @@ class AgentService:
                "allow_candidate_generation": False,
            }
        )
-        if team_execution_context:
-            attempt_kwargs["execution_context"] = self._join_context(base_execution_context, team_execution_context)
+        attempt_kwargs["execution_context"] = self._join_context(
+            base_execution_context,
+            output_language_instruction,
+            team_execution_context,
+        )
        if plan.is_team and team_execution_context:
            attempt_kwargs["include_tools"] = False
            attempt_kwargs["max_tool_iterations"] = 0
@ -979,6 +986,24 @@ class AgentService:
            "short_title": decision.short_title,
        }

+    @staticmethod
+    def _output_language_instruction(prompt_locale: str | None) -> str:
+        locale = normalize_main_agent_prompt_locale(prompt_locale)
+        if locale == "en":
+            return (
+                "Output language: English. Use English for user-facing task titles, summaries, plans, "
+                "and final answers unless the user explicitly requests another language."
+            )
+        if locale == "zh-Hant":
+            return (
+                "輸出語言：繁體中文。除非使用者明確要求其他語言，所有面向使用者的任務標題、摘要、"
+                "計劃與最終回答都使用繁體中文。"
+            )
+        return (
+            "输出语言：简体中文。除非用户明确要求其他语言，所有面向用户的任务标题、摘要、"
+            "计划与最终回答都使用简体中文。"
+        )
+
    @staticmethod
    def _skill_names_for_run(loaded: Any, run_id: str) -> list[str]:
        store = getattr(loaded, "run_memory_store", None)
--- a/app-instance/backend/beaver/skills/learning/init.py
+++ b/app-instance/backend/beaver/skills/learning/init.py
@ -1,5 +1,6 @@
 """Skill learning loop helpers."""

+from .case_selection import select_replay_cases
 from .evidence import EvidencePacket, EvidenceSelector
 from .eval import SkillDraftEvaluator
 from .missing_skill import (
@ -9,11 +10,15 @@ from .missing_skill import (
    MissingSkillSynthesizer,
 )
 from .pipeline import SkillLearningPipelineService
+from .preservation import check_preservation
+from .replay import ReplayArmRequest, ReplayRunner, ReplayToolExecutor, ReplayToolPolicy, classify_tool_mode
 from .service import RunReceiptContext, SkillLearningService
+from .surrogate import SurrogateToolEvaluator
 from .synthesizer import SkillDraftSynthesizer
 from .worker import SkillLearningWorker, SkillLearningWorkerConfig, SkillLearningWorkerResult

 __all__ = [
+    "select_replay_cases",
    "EvidencePacket",
    "EvidenceSelector",
    "SkillDraftEvaluator",
@ -23,6 +28,13 @@ __all__ = [
    "MissingSkillSynthesizer",
    "RunReceiptContext",
    "SkillLearningPipelineService",
+    "check_preservation",
+    "ReplayToolExecutor",
+    "ReplayToolPolicy",
+    "ReplayArmRequest",
+    "ReplayRunner",
+    "classify_tool_mode",
+    "SurrogateToolEvaluator",
    "SkillDraftSynthesizer",
    "SkillLearningService",
    "SkillLearningWorker",
--- a/app-instance/backend/beaver/skills/learning/case_selection.py
+++ b/app-instance/backend/beaver/skills/learning/case_selection.py
@ -0,0 +1,109 @@
+"""Historical replay case selection for skill draft evaluation."""
+
+from __future__ import annotations
+
+from typing import Any
+
+from beaver.memory.runs import RunRecord
+from beaver.memory.skills import SkillLearningCandidate
+
+MAX_REPLAY_CASES = 10
+
+
+def select_replay_cases(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[dict[str, Any]]:
+    accepted = [record for record in runs if _is_accepted(record)]
+    if candidate.kind == "revise_skill":
+        selected = _select_revise(candidate, accepted)
+    elif candidate.kind == "merge_skills":
+        selected = _select_merge(candidate, accepted)
+    else:
+        selected = _select_new(candidate, accepted)
+    return [_case_payload(candidate, record) for record in selected[:MAX_REPLAY_CASES]]
+
+
+def _select_revise(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
+    target = candidate.related_skill_names[0] if candidate.related_skill_names else ""
+    version = str(candidate.evidence.get("skill_version") or "")
+    matches = [
+        record
+        for record in runs
+        if any(
+            receipt.skill_name == target and (not version or receipt.skill_version == version)
+            for receipt in record.activated_skills
+        )
+    ]
+    return _recent_diverse(matches)
+
+
+def _select_merge(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
+    targets = set(candidate.related_skill_names)
+    matches = [
+        record
+        for record in runs
+        if targets and targets.issubset({receipt.skill_name for receipt in record.activated_skills})
+    ]
+    return _recent_diverse(matches)
+
+
+def _select_new(candidate: SkillLearningCandidate, runs: list[RunRecord]) -> list[RunRecord]:
+    source_ids = set(candidate.source_run_ids)
+    if source_ids:
+        matches = [record for record in runs if record.run_id in source_ids]
+    else:
+        theme = str(candidate.evidence.get("theme") or "").lower().strip()
+        matches = [record for record in runs if theme and theme in record.task_text.lower()]
+    return _recent_diverse(matches)
+
+
+def _case_payload(candidate: SkillLearningCandidate, record: RunRecord) -> dict[str, Any]:
+    baseline_skill_names = []
+    if candidate.kind == "revise_skill":
+        baseline_skill_names = list(candidate.related_skill_names[:1])
+    elif candidate.kind == "merge_skills":
+        baseline_skill_names = list(candidate.related_skill_names)
+    return {
+        "run_id": record.run_id,
+        "task_id": record.task_id,
+        "session_id": record.session_id,
+        "task_text": record.task_text,
+        "baseline_skill_names": baseline_skill_names,
+        "candidate_skill_name": candidate.draft_skill_name,
+        "accepted_score": _score(record),
+    }
+
+
+def _recent_diverse(runs: list[RunRecord]) -> list[RunRecord]:
+    sorted_runs = sorted(runs, key=lambda item: (item.started_at, item.run_id), reverse=True)
+    result: list[RunRecord] = []
+    seen_tasks: set[str] = set()
+    for record in sorted_runs:
+        task_key = record.task_id or record.task_text
+        if task_key in seen_tasks and len(sorted_runs) > MAX_REPLAY_CASES:
+            continue
+        seen_tasks.add(task_key)
+        result.append(record)
+        if len(result) >= MAX_REPLAY_CASES:
+            break
+    if len(result) < min(len(sorted_runs), MAX_REPLAY_CASES):
+        seen_run_ids = {record.run_id for record in result}
+        result.extend(record for record in sorted_runs if record.run_id not in seen_run_ids)
+    return result[:MAX_REPLAY_CASES]
+
+
+def _is_accepted(record: RunRecord) -> bool:
+    feedback = record.feedback or {}
+    acceptance = feedback.get("acceptance_type")
+    if acceptance is None and feedback.get("feedback_type") == "satisfied":
+        acceptance = "accept"
+    return bool(record.success) and acceptance == "accept"
+
+
+def _score(record: RunRecord) -> float:
+    validation = record.validation_result or {}
+    value = validation.get("score") if isinstance(validation, dict) else None
+    if value is not None:
+        try:
+            return max(0.0, min(1.0, float(value)))
+        except (TypeError, ValueError):
+            pass
+    return 0.8 if record.success else 0.4
--- a/app-instance/backend/beaver/skills/learning/eval.py
+++ b/app-instance/backend/beaver/skills/learning/eval.py
@ -4,17 +4,28 @@ from __future__ import annotations

 from uuid import uuid4

+from beaver.engine.context import SkillContext
 from beaver.engine.providers import ProviderBundle
 from beaver.memory.runs import RunMemoryStore
 from beaver.memory.skills import SkillDraftEvalReport, SkillLearningCandidate
+from beaver.skills.learning.case_selection import select_replay_cases
+from beaver.skills.learning.preservation import check_preservation
+from beaver.skills.learning.replay import ReplayArmRequest, ReplayRunner
+from beaver.skills.learning.surrogate import SurrogateToolEvaluator
 from beaver.skills.specs import SkillDraft


 class SkillDraftEvaluator:
    """Builds a bounded eval report without writing user-visible sessions."""

-    def __init__(self, run_store: RunMemoryStore) -> None:
+    def __init__(
+        self,
+        run_store: RunMemoryStore,
+        *,
+        surrogate_evaluator: SurrogateToolEvaluator | None = None,
+    ) -> None:
        self.run_store = run_store
+        self.surrogate_evaluator = surrogate_evaluator or SurrogateToolEvaluator()

    async def evaluate(
        self,
@ -22,11 +33,30 @@ class SkillDraftEvaluator:
        candidate: SkillLearningCandidate,
        draft: SkillDraft,
        provider_bundle: ProviderBundle | None,
+        replay_runner: ReplayRunner | None = None,
    ) -> SkillDraftEvalReport:
        if provider_bundle is None or provider_bundle.main_provider is None:
            return self._skipped(candidate, draft)

-        runs_by_id = {record.run_id: record for record in self.run_store.list_runs()}
+        runs = self.run_store.list_runs()
+        replay_cases = select_replay_cases(candidate, runs)
+        if replay_runner is not None and replay_cases:
+            return await self._evaluate_replay(
+                candidate=candidate,
+                draft=draft,
+                replay_cases=replay_cases,
+                provider_bundle=provider_bundle,
+                replay_runner=replay_runner,
+            )
+        return self._evaluate_heuristic(candidate, draft, runs)
+
+    def _evaluate_heuristic(
+        self,
+        candidate: SkillLearningCandidate,
+        draft: SkillDraft,
+        runs: list,
+    ) -> SkillDraftEvalReport:
+        runs_by_id = {record.run_id: record for record in runs}
        cases: list[dict] = []
        for run_id in candidate.source_run_ids[:8]:
            record = runs_by_id.get(run_id)
@ -78,6 +108,78 @@ class SkillDraftEvaluator:
            created_at=_utc_now(),
        )

+    async def _evaluate_replay(
+        self,
+        *,
+        candidate: SkillLearningCandidate,
+        draft: SkillDraft,
+        replay_cases: list[dict],
+        provider_bundle: ProviderBundle,
+        replay_runner: ReplayRunner,
+    ) -> SkillDraftEvalReport:
+        case_reports: list[dict] = []
+        legacy_cases: list[dict] = []
+        for case in replay_cases:
+            baseline = await replay_runner.run_arm(
+                ReplayArmRequest(
+                    case_id=f"{case['run_id']}:baseline",
+                    arm="baseline",
+                    task_text=str(case["task_text"]),
+                    pinned_skill_names=list(case.get("baseline_skill_names") or []),
+                    pinned_skill_contexts=[],
+                    provider_bundle=provider_bundle,
+                    model_settings={"max_tool_iterations": 4, "temperature": 0.0},
+                )
+            )
+            candidate_arm = await replay_runner.run_arm(
+                ReplayArmRequest(
+                    case_id=f"{case['run_id']}:candidate",
+                    arm="candidate",
+                    task_text=str(case["task_text"]),
+                    pinned_skill_names=[],
+                    pinned_skill_contexts=[_draft_skill_context(draft)],
+                    provider_bundle=provider_bundle,
+                    model_settings={"max_tool_iterations": 4, "temperature": 0.0},
+                )
+            )
+            surrogate = await self.surrogate_evaluator.evaluate(
+                task_text=str(case["task_text"]),
+                baseline=baseline,
+                candidate=candidate_arm,
+            )
+            baseline_score = surrogate["baseline_score"]
+            candidate_score = surrogate["candidate_score"]
+            case_report = {
+                "run_id": case["run_id"],
+                "task_id": case.get("task_id"),
+                "session_id": case.get("session_id"),
+                "baseline": baseline,
+                "candidate": candidate_arm,
+                "baseline_score": baseline_score,
+                "candidate_score": candidate_score,
+                "delta": round(candidate_score - baseline_score, 4),
+                "execution_coverage": _arm_mode_coverage(baseline, candidate_arm, "executed"),
+                "surrogate_coverage": _arm_mode_coverage(baseline, candidate_arm, "surrogate"),
+                "blocked_tool_count": _arm_mode_count(baseline, candidate_arm, "blocked"),
+                "confidence": surrogate["confidence"],
+                "tool_calls": [*baseline.get("tool_calls", []), *candidate_arm.get("tool_calls", [])],
+                "artifacts": [*baseline.get("artifacts", []), *candidate_arm.get("artifacts", [])],
+                "side_effects": [*baseline.get("side_effects", []), *candidate_arm.get("side_effects", [])],
+                "validator_notes": list(surrogate.get("notes") or []),
+            }
+            case_reports.append(case_report)
+            legacy_cases.append(
+                {
+                    "run_id": case["run_id"],
+                    "session_id": case.get("session_id") or "",
+                    "baseline_score": baseline_score,
+                    "candidate_score": candidate_score,
+                    "delta": round(candidate_score - baseline_score, 4),
+                }
+            )
+        preservation_report = _preservation_report(candidate, draft)
+        return _report_from_case_reports(candidate, draft, case_reports, legacy_cases, preservation_report)
+
    def _skipped(self, candidate: SkillLearningCandidate, draft: SkillDraft) -> SkillDraftEvalReport:
        return SkillDraftEvalReport(
            report_id=uuid4().hex,
@ -115,6 +217,108 @@ def _candidate_score(baseline: float, draft: SkillDraft) -> float:
    return min(1.0, max(0.75, baseline + 0.05))


+def _draft_skill_context(draft: SkillDraft) -> SkillContext:
+    tool_hints = draft.proposed_frontmatter.get("tools")
+    return SkillContext(
+        name=f"draft:{draft.skill_name}",
+        content=draft.proposed_content,
+        version=draft.draft_id,
+        content_hash="draft",
+        activation_reason="skill_replay_eval_candidate",
+        tool_hints=[str(item) for item in tool_hints if str(item).strip()] if isinstance(tool_hints, list) else [],
+    )
+
+
+def _preservation_report(candidate: SkillLearningCandidate, draft: SkillDraft) -> dict | None:
+    if candidate.kind not in {"revise_skill", "merge_skills"}:
+        return None
+    base_content = str(candidate.evidence.get("base_content") or "") if isinstance(candidate.evidence, dict) else ""
+    if not base_content.strip():
+        return None
+    return check_preservation(base_content=base_content, draft_content=draft.proposed_content)
+
+
+def _report_from_case_reports(
+    candidate: SkillLearningCandidate,
+    draft: SkillDraft,
+    case_reports: list[dict],
+    legacy_cases: list[dict],
+    preservation_report: dict | None,
+) -> SkillDraftEvalReport:
+    baseline_avg = sum(item["baseline_score"] for item in legacy_cases) / len(legacy_cases)
+    candidate_avg = sum(item["candidate_score"] for item in legacy_cases) / len(legacy_cases)
+    regressions = [item for item in legacy_cases if item["candidate_score"] < item["baseline_score"]]
+    improved = [item for item in legacy_cases if item["candidate_score"] > item["baseline_score"]]
+    unchanged = len(legacy_cases) - len(regressions) - len(improved)
+    execution, surrogate, blocked = _coverage(case_reports)
+    confidence = _confidence(execution, surrogate, blocked, [item.get("confidence") for item in case_reports])
+    score_delta = candidate_avg - baseline_avg
+    passed = candidate_avg >= 0.75 and not (regressions and score_delta <= 0) and blocked < 1.0
+    return SkillDraftEvalReport(
+        report_id=uuid4().hex,
+        skill_name=draft.skill_name,
+        draft_id=draft.draft_id,
+        candidate_id=candidate.candidate_id,
+        passed=passed,
+        baseline_score_avg=round(baseline_avg, 4),
+        candidate_score_avg=round(candidate_avg, 4),
+        score_delta=round(score_delta, 4),
+        regression_count=len(regressions),
+        improved_count=len(improved),
+        unchanged_count=unchanged,
+        cases=legacy_cases,
+        status="completed",
+        created_at=_utc_now(),
+        eval_version="replay-v1",
+        mode="replay",
+        execution_coverage=execution,
+        surrogate_coverage=surrogate,
+        blocked_coverage=blocked,
+        confidence=confidence,
+        case_reports=case_reports,
+        tool_mode_summary={"executed": execution, "surrogate": surrogate, "blocked": blocked},
+        preservation_report=preservation_report,
+    )
+
+
+def _coverage(case_reports: list[dict]) -> tuple[float, float, float]:
+    counts = {"executed": 0, "surrogate": 0, "blocked": 0}
+    for report in case_reports:
+        for call in report.get("tool_calls") or []:
+            if isinstance(call, dict) and call.get("mode") in counts:
+                counts[str(call["mode"])] += 1
+    total = sum(counts.values())
+    if total == 0:
+        return 1.0, 0.0, 0.0
+    return (
+        round(counts["executed"] / total, 4),
+        round(counts["surrogate"] / total, 4),
+        round(counts["blocked"] / total, 4),
+    )
+
+
+def _confidence(execution: float, surrogate: float, blocked: float, case_confidences: list[object]) -> str:
+    if blocked > 0.0:
+        return "low"
+    if execution >= 0.75 and surrogate <= 0.25:
+        return "high"
+    if execution >= 0.25 or "medium" in case_confidences:
+        return "medium"
+    return "low"
+
+
+def _arm_mode_coverage(baseline: dict, candidate: dict, mode: str) -> float:
+    calls = [*baseline.get("tool_calls", []), *candidate.get("tool_calls", [])]
+    if not calls:
+        return 1.0 if mode == "executed" else 0.0
+    return round(sum(1 for call in calls if isinstance(call, dict) and call.get("mode") == mode) / len(calls), 4)
+
+
+def _arm_mode_count(baseline: dict, candidate: dict, mode: str) -> int:
+    calls = [*baseline.get("tool_calls", []), *candidate.get("tool_calls", [])]
+    return sum(1 for call in calls if isinstance(call, dict) and call.get("mode") == mode)
+
+
 def _utc_now() -> str:
    from datetime import datetime, timezone

--- a/app-instance/backend/beaver/skills/learning/pipeline.py
+++ b/app-instance/backend/beaver/skills/learning/pipeline.py
@ -8,6 +8,7 @@ from beaver.engine.providers import ProviderBundle
 from beaver.memory.skills import SkillDraftEvalReport, SkillDraftSafetyReport, SkillLearningCandidate, SkillLearningStore
 from beaver.skills.drafts import DraftService
 from beaver.skills.learning.eval import SkillDraftEvaluator
+from beaver.skills.learning.replay import ReplayRunner
 from beaver.skills.learning.service import SkillLearningService
 from beaver.skills.learning.safety import SkillDraftSafetyChecker
 from beaver.skills.publisher import SkillPublisher
@ -285,11 +286,17 @@ class SkillLearningPipelineService:
        draft_id: str,
        *,
        provider_bundle: ProviderBundle | None,
+        replay_runner: ReplayRunner | None = None,
    ) -> SkillDraftEvalReport:
        draft = self.get_draft(skill_name, draft_id)
        candidate = self.get_candidate(candidate_id)
        evaluator = self.evaluator or SkillDraftEvaluator(self.learning_service.run_store)
-        report = await evaluator.evaluate(candidate=candidate, draft=draft, provider_bundle=provider_bundle)
+        report = await evaluator.evaluate(
+            candidate=candidate,
+            draft=draft,
+            provider_bundle=provider_bundle,
+            replay_runner=replay_runner,
+        )
        self.learning_store.write_eval_report(report)
        if report.status == "skipped_provider_unavailable":
            status = "draft_ready"
@ -330,6 +337,14 @@ class SkillLearningPipelineService:
        eval_report = self.get_eval_report(draft.skill_name, draft.draft_id)
        if eval_report is not None and eval_report.status != "skipped_provider_unavailable" and not eval_report.passed:
            raise ValueError("Draft eval report did not pass")
+        if eval_report is not None and eval_report.mode == "replay":
+            if eval_report.confidence == "low":
+                raise ValueError("Draft replay eval has low confidence and requires revision before publish")
+            if eval_report.blocked_coverage >= 1.0:
+                raise ValueError("Draft replay eval blocked all important tool calls")
+            preservation = eval_report.preservation_report or {}
+            if preservation.get("passed") is False:
+                raise ValueError("Draft preservation check did not pass")

    def _mark_candidate_by_draft(
        self,
--- a/app-instance/backend/beaver/skills/learning/preservation.py
+++ b/app-instance/backend/beaver/skills/learning/preservation.py
@ -0,0 +1,53 @@
+"""Preservation checks for skill revision drafts."""
+
+from __future__ import annotations
+
+import re
+from typing import Any
+
+
+def check_preservation(*, base_content: str, draft_content: str) -> dict[str, Any]:
+    base_sections = _sections(base_content)
+    draft_sections = _sections(draft_content)
+    preserved: list[str] = []
+    changed: list[str] = []
+    dropped: list[str] = []
+
+    for heading, body in base_sections.items():
+        draft_body = draft_sections.get(heading)
+        if draft_body is None:
+            dropped.append(heading)
+            continue
+        preserved.append(heading)
+        if _normalize(body) != _normalize(draft_body):
+            changed.append(heading)
+
+    risk_level = "high" if dropped else "low"
+    return {
+        "passed": not dropped,
+        "risk_level": risk_level,
+        "preserved_sections": preserved,
+        "changed_sections": changed,
+        "dropped_sections": dropped,
+    }
+
+
+def _sections(content: str) -> dict[str, str]:
+    current = "body"
+    sections: dict[str, list[str]] = {current: []}
+    for line in (content or "").splitlines():
+        match = re.match(r"^#{1,6}\s+(.+?)\s*$", line)
+        if match:
+            current = match.group(1).strip()
+            sections.setdefault(current, [])
+            continue
+        sections.setdefault(current, []).append(line)
+    return {
+        heading: "\n".join(lines).strip()
+        for heading, lines in sections.items()
+        if "\n".join(lines).strip()
+    }
+
+
+def _normalize(value: str) -> str:
+    return re.sub(r"\s+", " ", value or "").strip().lower()
--- a/app-instance/backend/beaver/skills/learning/replay.py
+++ b/app-instance/backend/beaver/skills/learning/replay.py
@ -0,0 +1,203 @@
+"""Replay execution helpers for skill draft evaluation."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Any, Literal
+from uuid import uuid4
+
+from beaver.tools.base import ToolContext, ToolResult, ToolSpec
+from beaver.tools.registry.tool_registry import ToolRegistry
+from beaver.tools.runtime.executor import ToolExecutor
+
+ToolExecutionMode = Literal["executed", "surrogate", "blocked"]
+
+
+@dataclass(slots=True)
+class ReplayToolPolicy:
+    safe_toolsets: set[str] = field(default_factory=lambda: {"filesystem", "user_files", "core", "web", "search"})
+    surrogate_transports: set[str] = field(default_factory=lambda: {"mcp", "connector"})
+    destructive_terms: tuple[str, ...] = (
+        "delete",
+        "remove",
+        "destroy",
+        "revoke",
+        "permission",
+        "credential",
+        "payment",
+        "pay",
+    )
+    external_write_terms: tuple[str, ...] = (
+        "send",
+        "post",
+        "publish",
+        "create",
+        "update",
+        "invite",
+        "reply",
+        "forward",
+    )
+
+
+class ReplayToolExecutor:
+    def __init__(
+        self,
+        inner: ToolExecutor,
+        *,
+        registry: ToolRegistry,
+        policy: ReplayToolPolicy | None = None,
+    ) -> None:
+        self.inner = inner
+        self.registry = registry
+        self.policy = policy or ReplayToolPolicy()
+        self.traces: list[dict[str, Any]] = []
+
+    async def execute(
+        self,
+        tool_name: str,
+        arguments: dict[str, Any] | None,
+        *,
+        context: ToolContext | None = None,
+    ) -> ToolResult:
+        tool = self.registry.get(tool_name)
+        spec = tool.spec if tool is not None else ToolSpec(
+            name=tool_name,
+            description="unregistered tool",
+            input_schema={"type": "object", "properties": {}},
+            toolset="unknown",
+        )
+        mode = classify_tool_mode(spec, self.policy)
+        trace = {
+            "trace_id": uuid4().hex,
+            "tool_name": tool_name,
+            "mode": mode,
+            "arguments": dict(arguments or {}),
+            "schema": dict(spec.input_schema),
+            "toolset": spec.toolset,
+            "metadata": dict(spec.metadata),
+            "classification_reason": _classification_reason(spec, mode),
+        }
+        if mode == "executed":
+            result = await self.inner.execute(tool_name, arguments or {}, context=context)
+            trace["result"] = {
+                "success": result.success,
+                "error": result.error,
+                "content": result.content[:2000],
+            }
+            self.traces.append(trace)
+            return result
+        if mode == "surrogate":
+            trace["result"] = {
+                "success": True,
+                "error": "replay_surrogate",
+                "content": "Tool call recorded for surrogate evaluation.",
+            }
+            self.traces.append(trace)
+            return ToolResult(
+                success=True,
+                content="Tool call recorded for surrogate evaluation.",
+                tool_name=tool_name,
+                error="replay_surrogate",
+                raw_output=trace,
+            )
+        trace["result"] = {
+            "success": False,
+            "error": "replay_blocked",
+            "content": "Tool call blocked by replay policy.",
+        }
+        self.traces.append(trace)
+        return ToolResult(
+            success=False,
+            content="Tool call blocked by replay policy.",
+            tool_name=tool_name,
+            error="replay_blocked",
+            raw_output=trace,
+        )
+
+    async def execute_tool_call(self, tool_call: Any, *, context: ToolContext | None = None) -> ToolResult:
+        tool_name, arguments = ToolExecutor._normalize_tool_call(tool_call)
+        return await self.execute(tool_name, arguments, context=context)
+
+
+def classify_tool_mode(spec: ToolSpec, policy: ReplayToolPolicy | None = None) -> ToolExecutionMode:
+    policy = policy or ReplayToolPolicy()
+    name = spec.name.lower()
+    toolset = spec.toolset.lower()
+    metadata = {str(key).lower(): str(value).lower() for key, value in spec.metadata.items()}
+    if any(term in name for term in policy.destructive_terms):
+        return "blocked"
+    if toolset in policy.safe_toolsets:
+        return "executed"
+    if metadata.get("transport") in policy.surrogate_transports or toolset in {"mcp", "connector", "external"}:
+        if any(term in name for term in policy.external_write_terms):
+            return "surrogate"
+        return "executed"
+    return "surrogate"
+
+
+def _classification_reason(spec: ToolSpec, mode: ToolExecutionMode) -> str:
+    return f"{spec.name} classified as {mode} from toolset={spec.toolset} metadata={spec.metadata}"
+
+
+@dataclass(slots=True)
+class ReplayArmRequest:
+    case_id: str
+    arm: str
+    task_text: str
+    pinned_skill_names: list[str] = field(default_factory=list)
+    pinned_skill_contexts: list[Any] = field(default_factory=list)
+    provider_bundle: Any | None = None
+    model_settings: dict[str, Any] = field(default_factory=dict)
+
+
+class ReplayRunner:
+    def __init__(self, *, agent_loop: Any, policy: ReplayToolPolicy | None = None) -> None:
+        self.agent_loop = agent_loop
+        self.policy = policy or ReplayToolPolicy()
+
+    async def run_arm(self, request: ReplayArmRequest) -> dict[str, Any]:
+        loaded = self.agent_loop.boot()
+        replay_executor = ReplayToolExecutor(
+            loaded.tool_executor,
+            registry=loaded.tool_registry,
+            policy=self.policy,
+        )
+        result = await self.agent_loop.process_direct(
+            request.task_text,
+            provider_bundle=request.provider_bundle,
+            include_skill_assembly=False,
+            include_tools=True,
+            pinned_skill_names=request.pinned_skill_names,
+            pinned_skill_contexts=request.pinned_skill_contexts,
+            max_tool_iterations=int(request.model_settings.get("max_tool_iterations") or 4),
+            temperature=float(request.model_settings.get("temperature") or 0.0),
+            source="skill_replay_eval",
+            tool_executor_override=replay_executor,
+        )
+        return {
+            "case_id": request.case_id,
+            "arm": request.arm,
+            "session_id": result.session_id,
+            "run_id": result.run_id,
+            "task_text": request.task_text,
+            "finish_reason": result.finish_reason,
+            "final_answer": result.output_text,
+            "tool_calls": list(replay_executor.traces),
+            "artifacts": [],
+            "side_effects": _side_effects_from_traces(replay_executor.traces),
+        }
+
+
+def _side_effects_from_traces(traces: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    effects: list[dict[str, Any]] = []
+    for trace in traces:
+        if trace.get("mode") in {"surrogate", "blocked"}:
+            effects.append(
+                {
+                    "tool_name": trace.get("tool_name"),
+                    "mode": trace.get("mode"),
+                    "arguments": trace.get("arguments"),
+                    "classification_reason": trace.get("classification_reason"),
+                }
+            )
+    return effects
--- a/app-instance/backend/beaver/skills/learning/service.py
+++ b/app-instance/backend/beaver/skills/learning/service.py
@ -205,7 +205,13 @@ class SkillLearningService:
            )
        if candidate.kind == "merge_skills":
            target_name = self._suggest_skill_name(candidate, packet)
-            payload = await self.synthesizer.synthesize_merge(candidate, packet, provider, model)
+            payload = await self.synthesizer.synthesize_merge(
+                candidate,
+                packet,
+                provider,
+                model,
+                base_skill=self._merged_base_skill_snapshot(candidate.related_skill_names),
+            )
            return self.draft_service.create_merge_draft(
                skill_name=target_name,
                base_version=None,
@ -217,7 +223,13 @@ class SkillLearningService:
            )
        target_skill = candidate.related_skill_names[0]
        base_version = candidate.evidence.get("skill_version")
-        payload = await self.synthesizer.synthesize_revision(candidate, packet, provider, model)
+        payload = await self.synthesizer.synthesize_revision(
+            candidate,
+            packet,
+            provider,
+            model,
+            base_skill=self._base_skill_snapshot(target_skill, base_version),
+        )
        return self.draft_service.create_revision_draft(
            skill_name=target_skill,
            base_version=base_version,
@ -228,6 +240,46 @@ class SkillLearningService:
            evidence_refs=[{"run_id": item} for item in candidate.source_run_ids],
        )

+    def _base_skill_snapshot(self, skill_name: str, version: str | None) -> dict[str, Any] | None:
+        loaded = self.draft_service.store.read_published_skill(skill_name, version)
+        if loaded is None:
+            return None
+        return {
+            "skill_name": loaded.version.skill_name,
+            "version": loaded.version.version,
+            "frontmatter": dict(loaded.version.frontmatter),
+            "content": loaded.content,
+            "summary": loaded.version.summary,
+            "tool_hints": list(loaded.version.tool_hints),
+        }
+
+    def _merged_base_skill_snapshot(self, skill_names: list[str]) -> dict[str, Any] | None:
+        snapshots = [
+            snapshot
+            for name in skill_names
+            if (snapshot := self._base_skill_snapshot(name, None)) is not None
+        ]
+        if not snapshots:
+            return None
+        return {
+            "skill_name": "merge:" + ",".join(str(item["skill_name"]) for item in snapshots),
+            "version": "mixed",
+            "frontmatter": {"merged_skills": [item["frontmatter"] for item in snapshots]},
+            "content": "\n\n".join(
+                f"<!-- base skill: {item['skill_name']} {item['version']} -->\n{item['content']}"
+                for item in snapshots
+            ),
+            "summary": "\n".join(str(item["summary"]) for item in snapshots if item.get("summary")),
+            "tool_hints": list(
+                dict.fromkeys(
+                    tool
+                    for item in snapshots
+                    for tool in item.get("tool_hints", [])
+                    if str(tool).strip()
+                )
+            ),
+        }
+
    def rescore_skill_versions(self) -> list[SkillPerformanceSnapshot]:
        snapshots: list[SkillPerformanceSnapshot] = []
        grouped: dict[tuple[str, str], list[SkillEffectRecord]] = {}
--- a/app-instance/backend/beaver/skills/learning/surrogate.py
+++ b/app-instance/backend/beaver/skills/learning/surrogate.py
@ -0,0 +1,53 @@
+"""Surrogate evaluation for replay tool calls that cannot execute safely."""
+
+from __future__ import annotations
+
+from typing import Any
+
+
+class SurrogateToolEvaluator:
+    async def evaluate(self, *, task_text: str, baseline: dict[str, Any], candidate: dict[str, Any]) -> dict[str, Any]:
+        baseline_score = _score_arm(task_text, baseline)
+        candidate_score = _score_arm(task_text, candidate)
+        surrogate_count = _mode_count(baseline, "surrogate") + _mode_count(candidate, "surrogate")
+        blocked_count = _mode_count(baseline, "blocked") + _mode_count(candidate, "blocked")
+        confidence = "low" if blocked_count else ("medium" if surrogate_count <= 2 else "low")
+        return {
+            "baseline_score": baseline_score,
+            "candidate_score": candidate_score,
+            "delta": round(candidate_score - baseline_score, 4),
+            "surrogate_tool_count": surrogate_count,
+            "blocked_tool_count": blocked_count,
+            "confidence": confidence,
+            "notes": [
+                "Surrogate score is based on intended tool calls, schemas, arguments, and task relevance.",
+            ],
+        }
+
+
+def _score_arm(task_text: str, arm: dict[str, Any]) -> float:
+    calls = [item for item in arm.get("tool_calls") or [] if isinstance(item, dict)]
+    if not calls:
+        return 0.5
+    scores = [_score_call(task_text, call) for call in calls]
+    return round(sum(scores) / len(scores), 4)
+
+
+def _score_call(task_text: str, call: dict[str, Any]) -> float:
+    if call.get("mode") == "blocked":
+        return 0.2
+    if call.get("mode") == "executed":
+        result = call.get("result") if isinstance(call.get("result"), dict) else {}
+        return 0.85 if result.get("success") is not False else 0.35
+    arguments = dict(call.get("arguments") or {})
+    if not arguments:
+        return 0.45
+    non_empty = sum(1 for value in arguments.values() if str(value).strip())
+    completeness = non_empty / max(1, len(arguments))
+    argument_text = " ".join(str(value).lower() for value in arguments.values())
+    relevance = 0.15 if any(token and token in argument_text for token in task_text.lower().split()[:16]) else 0.0
+    return round(min(0.9, 0.5 + 0.3 * completeness + relevance), 4)
+
+
+def _mode_count(arm: dict[str, Any], mode: str) -> int:
+    return sum(1 for item in arm.get("tool_calls") or [] if isinstance(item, dict) and item.get("mode") == mode)
--- a/app-instance/backend/beaver/skills/learning/synthesizer.py
+++ b/app-instance/backend/beaver/skills/learning/synthesizer.py
@ -17,8 +17,9 @@ class SkillDraftSynthesizer:
        evidence_packet: EvidencePacket,
        provider: LLMProvider,
        model: str,
+        base_skill: dict[str, Any] | None = None,
    ) -> dict[str, Any]:
-        return await self._synthesize(candidate, evidence_packet, provider, model, "revise")
+        return await self._synthesize(candidate, evidence_packet, provider, model, "revise", base_skill=base_skill)

    async def synthesize_new_skill(
        self,
@ -27,7 +28,7 @@ class SkillDraftSynthesizer:
        provider: LLMProvider,
        model: str,
    ) -> dict[str, Any]:
-        return await self._synthesize(candidate, evidence_packet, provider, model, "new")
+        return await self._synthesize(candidate, evidence_packet, provider, model, "new", base_skill=None)

    async def synthesize_merge(
        self,
@ -35,8 +36,9 @@ class SkillDraftSynthesizer:
        evidence_packet: EvidencePacket,
        provider: LLMProvider,
        model: str,
+        base_skill: dict[str, Any] | None = None,
    ) -> dict[str, Any]:
-        return await self._synthesize(candidate, evidence_packet, provider, model, "merge")
+        return await self._synthesize(candidate, evidence_packet, provider, model, "merge", base_skill=base_skill)

    async def _synthesize(
        self,
@ -45,15 +47,18 @@ class SkillDraftSynthesizer:
        provider: LLMProvider,
        model: str,
        action: str,
+        *,
+        base_skill: dict[str, Any] | None,
    ) -> dict[str, Any]:
-        prompt = self._build_prompt(candidate, evidence_packet, action)
+        prompt = self._build_prompt(candidate, evidence_packet, action, base_skill=base_skill)
        response = await provider.chat(
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You synthesize Beaver skill drafts from execution evidence. "
-                        "Return only JSON with keys: frontmatter, content, change_reason."
+                        "Return only JSON with keys: frontmatter, content, change_reason, "
+                        "preserved_sections, changed_sections, dropped_sections."
                    ),
                },
                {"role": "user", "content": prompt},
@ -69,11 +74,30 @@ class SkillDraftSynthesizer:
        return self._fallback_payload(candidate, evidence_packet, action)

    @staticmethod
-    def _build_prompt(candidate: SkillLearningCandidate, evidence_packet: EvidencePacket, action: str) -> str:
+    def _build_prompt(
+        candidate: SkillLearningCandidate,
+        evidence_packet: EvidencePacket,
+        action: str,
+        base_skill: dict[str, Any] | None = None,
+    ) -> str:
        tool_names = _coerce_string_list(evidence_packet.metadata.get("tool_names"))
        tool_section = ", ".join(tool_names) if tool_names else "none observed"
        selected_tool_names = _coerce_string_list(evidence_packet.metadata.get("selected_tool_names"))
        selected_tool_section = ", ".join(selected_tool_names) if selected_tool_names else "none recorded"
+        base_section = ""
+        if base_skill:
+            base_section = (
+                "\n\nBase skill snapshot:\n"
+                f"- skill_name: {base_skill.get('skill_name')}\n"
+                f"- version: {base_skill.get('version')}\n"
+                f"- frontmatter: {json.dumps(base_skill.get('frontmatter') or {}, ensure_ascii=False, sort_keys=True)}\n"
+                f"- tool_hints: {base_skill.get('tool_hints') or []}\n"
+                f"- summary: {base_skill.get('summary') or ''}\n"
+                "Base skill content:\n"
+                f"{base_skill.get('content') or ''}\n"
+                "Preserve existing instructions unless the evidence requires a change. "
+                "If any section is changed or dropped, explain it in changed_sections or dropped_sections."
+            )
        return (
            f"Action: {action}\n"
            f"Candidate kind: {candidate.kind}\n"
@ -83,11 +107,13 @@ class SkillDraftSynthesizer:
            f"Run-selected tool names: {selected_tool_section}\n"
            f"Task summaries:\n- " + "\n- ".join(evidence_packet.task_summaries)
            + "\n\nSession excerpts:\n" + "\n\n".join(evidence_packet.session_excerpts)
+            + base_section
            + "\n\nReturn JSON only. The frontmatter object must include:"
            + "\n- description: a concise skill description"
            + "\n- tools: an explicit JSON array of exact tool names this skill needs. "
            + "Prefer called tool names when the workflow depends on them; use run-selected tool names only when clearly required. "
            + "Use [] only when no tool is required."
+            + "\nThe JSON may include preserved_sections, changed_sections, and dropped_sections arrays."
        )

    @staticmethod
@ -111,6 +137,9 @@ class SkillDraftSynthesizer:
            "frontmatter": frontmatter,
            "content": content_value.strip(),
            "change_reason": str(payload.get("change_reason") or ""),
+            "preserved_sections": _coerce_string_list(payload.get("preserved_sections")),
+            "changed_sections": _coerce_string_list(payload.get("changed_sections")),
+            "dropped_sections": _coerce_string_list(payload.get("dropped_sections")),
        }

    @staticmethod
@ -124,6 +153,9 @@ class SkillDraftSynthesizer:
            "frontmatter": frontmatter,
            "content": str(payload.get("content") or "").strip(),
            "change_reason": str(payload.get("change_reason") or ""),
+            "preserved_sections": _coerce_string_list(payload.get("preserved_sections")),
+            "changed_sections": _coerce_string_list(payload.get("changed_sections")),
+            "dropped_sections": _coerce_string_list(payload.get("dropped_sections")),
        }

    @staticmethod
@ -138,6 +170,9 @@ class SkillDraftSynthesizer:
            },
            "content": f"# {title}\n\n## Evidence\n\n{content}\n",
            "change_reason": candidate.reason or f"Fallback {action} synthesis.",
+            "preserved_sections": [],
+            "changed_sections": [],
+            "dropped_sections": [],
        }


--- a/app-instance/backend/tests/unit/test_agent_loop_replay_executor.py
+++ b/app-instance/backend/tests/unit/test_agent_loop_replay_executor.py
@ -0,0 +1,71 @@
+from __future__ import annotations
+
+from pathlib import Path
+from types import SimpleNamespace
+
+import pytest
+
+from beaver.engine.loader import EngineLoader
+from beaver.engine.loop import AgentLoop
+from beaver.engine.providers.base import LLMProvider, LLMResponse, ToolCallRequest
+from beaver.engine.providers.factory import ProviderBundle
+from beaver.skills.learning.replay import ReplayToolExecutor, ReplayToolPolicy
+
+
+class ToolCallingProvider(LLMProvider):
+    def __init__(self) -> None:
+        super().__init__()
+        self.calls = 0
+
+    async def chat(
+        self,
+        messages: list[dict],
+        tools: list[dict] | None = None,
+        model: str | None = None,
+        max_tokens: int | None = None,
+        temperature: float = 0.7,
+        thinking_enabled: bool | None = None,
+    ) -> LLMResponse:
+        self.calls += 1
+        if self.calls == 1:
+            return LLMResponse(
+                content="",
+                tool_calls=[
+                    ToolCallRequest(
+                        id="call-1",
+                        name="read_file",
+                        arguments={"path": "README.md"},
+                    )
+                ],
+            )
+        return LLMResponse(content="done")
+
+    def get_default_model(self) -> str:
+        return "stub"
+
+
+@pytest.mark.asyncio
+async def test_process_direct_uses_replay_tool_executor(tmp_path: Path) -> None:
+    loop = AgentLoop(loader=EngineLoader(workspace=tmp_path))
+    loaded = loop.boot()
+    provider = ToolCallingProvider()
+    runtime = SimpleNamespace(model="stub", provider_name="stub")
+    replay_executor = ReplayToolExecutor(
+        loaded.tool_executor,
+        registry=loaded.tool_registry,
+        policy=ReplayToolPolicy(),
+    )
+
+    result = await loop.process_direct(
+        "Read the README.",
+        provider_bundle=ProviderBundle(main_runtime=runtime, main_provider=provider),  # type: ignore[arg-type]
+        include_skill_assembly=False,
+        pinned_skill_names=[],
+        tool_executor_override=replay_executor,
+        max_tool_iterations=2,
+        source="skill_replay_eval",
+    )
+
+    assert result.output_text == "done"
+    assert replay_executor.traces
+    assert replay_executor.traces[0]["tool_name"] == "read_file"
--- a/app-instance/backend/tests/unit/test_context_builder.py
+++ b/app-instance/backend/tests/unit/test_context_builder.py
@ -26,3 +26,26 @@ def test_context_builder_injects_current_date_and_time() -> None:
    assert "Local UTC offset: +08:00" in system_prompt
    assert '"today", "tomorrow", "now", "this week", and "next month"' in system_prompt
    assert result.messages[-1] == {"role": "user", "content": "今天几号？"}
+
+
+def test_context_builder_uses_simplified_main_agent_prompt_by_default() -> None:
+    system_prompt = ContextBuilder().build_system_prompt(ContextBuildInput())
+
+    assert "你是海狸 (Beaver)" in system_prompt
+    assert "博维资讯系统有限公司研发的 AI 助手" in system_prompt
+    assert "使用简体中文进行面向用户的回复" in system_prompt
+
+
+def test_context_builder_uses_traditional_main_agent_prompt_for_zh_hant() -> None:
+    system_prompt = ContextBuilder().build_system_prompt(ContextBuildInput(prompt_locale="zh-Hant"))
+
+    assert "你是海狸 (Beaver)" in system_prompt
+    assert "博維資訊系統有限公司研發的 AI 助手" in system_prompt
+    assert "使用繁體中文進行面向使用者的回覆" in system_prompt
+
+
+def test_context_builder_uses_english_main_agent_prompt_for_en() -> None:
+    system_prompt = ContextBuilder().build_system_prompt(ContextBuildInput(prompt_locale="en"))
+
+    assert "You are Beaver, an AI assistant developed by Boway Information Systems Co., Ltd." in system_prompt
+    assert "Use English for user-facing replies" in system_prompt
--- a/app-instance/backend/tests/unit/test_litellm_thinking_mode.py
+++ b/app-instance/backend/tests/unit/test_litellm_thinking_mode.py
@ -253,6 +253,91 @@ def test_mistral_vllm_omits_reasoning_body_when_thinking_mode_is_unspecified(
    assert "extra_body" not in captured


+def test_mistral_openai_compatible_private_vllm_uses_reasoning_effort(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    captured: dict = {}
+
+    class Message:
+        content = "ok"
+        reasoning_content = None
+        tool_calls = []
+
+    class Choice:
+        message = Message()
+        finish_reason = "stop"
+
+    class Response:
+        choices = [Choice()]
+        usage = None
+
+    async def fake_acompletion(**kwargs):
+        captured.update(kwargs)
+        return Response()
+
+    monkeypatch.setattr("beaver.engine.providers.litellm.acompletion", fake_acompletion)
+    monkeypatch.setattr("beaver.engine.providers.litellm.litellm", SimpleNamespace())
+
+    provider = LiteLLMProvider(
+        api_key="EMPTY",
+        api_base="http://172.19.207.103/v1",
+        default_model="Mistral-Medium-3.5-128B",
+        provider_name="openai",
+    )
+    asyncio.run(
+        provider.chat(
+            [{"role": "user", "content": "reply ok"}],
+            model="Mistral-Medium-3.5-128B",
+            thinking_enabled=False,
+        )
+    )
+
+    assert captured["extra_body"] == {"reasoning_effort": "none"}
+    assert "chat_template_kwargs" not in captured["extra_body"]
+    assert "thinking" not in captured["extra_body"]
+
+
+def test_mistral_openai_compatible_private_vllm_omits_body_when_unspecified(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    captured: dict = {}
+
+    class Message:
+        content = "ok"
+        reasoning_content = None
+        tool_calls = []
+
+    class Choice:
+        message = Message()
+        finish_reason = "stop"
+
+    class Response:
+        choices = [Choice()]
+        usage = None
+
+    async def fake_acompletion(**kwargs):
+        captured.update(kwargs)
+        return Response()
+
+    monkeypatch.setattr("beaver.engine.providers.litellm.acompletion", fake_acompletion)
+    monkeypatch.setattr("beaver.engine.providers.litellm.litellm", SimpleNamespace())
+
+    provider = LiteLLMProvider(
+        api_key="EMPTY",
+        api_base="http://172.19.207.103/v1",
+        default_model="Mistral-Medium-3.5-128B",
+        provider_name="openai",
+    )
+    asyncio.run(
+        provider.chat(
+            [{"role": "user", "content": "reply ok"}],
+            model="Mistral-Medium-3.5-128B",
+        )
+    )
+
+    assert "extra_body" not in captured
+
+
 def test_litellm_provider_sanitizes_tool_call_arguments(monkeypatch: pytest.MonkeyPatch) -> None:
    captured: dict = {}

--- a/app-instance/backend/tests/unit/test_skill_learning_case_selection.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_case_selection.py
@ -0,0 +1,82 @@
+from __future__ import annotations
+
+from beaver.memory.runs import RunRecord
+from beaver.memory.skills import SkillLearningCandidate
+from beaver.skills.learning.case_selection import select_replay_cases
+from beaver.skills.specs import SkillActivationReceipt
+
+
+def _run(
+    run_id: str,
+    *,
+    task_id: str = "task",
+    session_id: str = "session",
+    task_text: str = "debug task",
+    skill_name: str | None = None,
+    skill_version: str = "v0001",
+) -> RunRecord:
+    receipts = []
+    if skill_name:
+        receipts.append(
+            SkillActivationReceipt(
+                run_id=run_id,
+                session_id=session_id,
+                skill_name=skill_name,
+                skill_version=skill_version,
+                content_hash="hash",
+                activated_at="now",
+                activation_reason="selected",
+            )
+        )
+    return RunRecord(
+        run_id=run_id,
+        session_id=session_id,
+        task_id=task_id,
+        attempt_index=1,
+        task_text=task_text,
+        started_at=f"2026-06-08T00:00:{run_id[-2:]}+00:00",
+        ended_at="end",
+        success=True,
+        finish_reason="stop",
+        feedback={"acceptance_type": "accept"},
+        activated_skills=receipts,
+    )
+
+
+def test_select_revise_cases_caps_at_ten_and_prefers_related_skill() -> None:
+    runs = [
+        _run(f"run-{index:02d}", task_id=f"task-{index}", skill_name="debug", skill_version="v0001")
+        for index in range(12)
+    ]
+    candidate = SkillLearningCandidate(
+        candidate_id="candidate-1",
+        kind="revise_skill",
+        source_run_ids=[],
+        source_session_ids=[],
+        related_skill_names=["debug"],
+        reason="revise",
+        evidence={"skill_version": "v0001"},
+    )
+
+    cases = select_replay_cases(candidate, runs)
+
+    assert len(cases) == 10
+    assert all(case["baseline_skill_names"] == ["debug"] for case in cases)
+    assert cases[0]["run_id"] == "run-11"
+
+
+def test_select_new_skill_uses_all_available_source_runs_under_ten() -> None:
+    runs = [_run(f"run-{index:02d}", task_id=f"task-{index}") for index in range(3)]
+    candidate = SkillLearningCandidate(
+        candidate_id="candidate-1",
+        kind="new_skill",
+        source_run_ids=["run-00", "run-01", "run-02"],
+        source_session_ids=["session"],
+        related_skill_names=[],
+        reason="new",
+    )
+
+    cases = select_replay_cases(candidate, runs)
+
+    assert [case["run_id"] for case in cases] == ["run-02", "run-01", "run-00"]
+    assert all(case["baseline_skill_names"] == [] for case in cases)
--- a/app-instance/backend/tests/unit/test_skill_learning_eval.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_eval.py
@ -44,6 +44,7 @@ def _pipeline(tmp_path: Path, *, task_score: float = 0.8) -> SkillLearningPipeli
            ended_at="end",
            success=True,
            finish_reason="stop",
+            feedback={"acceptance_type": "accept"},
            validation_result={"score": task_score, "passed": True},
        )
    )
@ -156,3 +157,59 @@ def test_eval_does_not_clear_safety_failed_status(tmp_path: Path) -> None:
    assert safety.passed is False
    assert report.passed is True
    assert pipeline.get_candidate("candidate-1").status == "safety_failed"
+
+
+class FakeReplayRunner:
+    async def run_arm(self, request):
+        return {
+            "case_id": request.case_id,
+            "arm": request.arm,
+            "session_id": "session-replay",
+            "run_id": f"{request.arm}-run",
+            "task_text": request.task_text,
+            "finish_reason": "stop",
+            "final_answer": "done",
+            "tool_calls": [
+                {
+                    "tool_name": "write_file",
+                    "mode": "executed",
+                    "arguments": {"path": "README.md"},
+                    "result": {"success": True, "content": "ok"},
+                }
+            ],
+            "artifacts": [],
+            "side_effects": [],
+        }
+
+
+def test_eval_report_includes_replay_case_and_coverage(tmp_path: Path) -> None:
+    pipeline = _pipeline(tmp_path)
+    draft = pipeline.draft_service.create_new_skill_draft(
+        skill_name="release-checklist",
+        proposed_content="# Release\n\nRun tests.",
+        proposed_frontmatter={"description": "release", "tools": []},
+        created_by="test",
+        reason="test",
+    )
+    pipeline.learning_store.update_learning_candidate(
+        "candidate-1",
+        draft_skill_name=draft.skill_name,
+        draft_id=draft.draft_id,
+    )
+
+    report = asyncio.run(
+        pipeline.evaluate_draft(
+            "candidate-1",
+            draft.skill_name,
+            draft.draft_id,
+            provider_bundle=_bundle(),
+            replay_runner=FakeReplayRunner(),
+        )
+    )
+
+    assert report.mode == "replay"
+    assert report.eval_version == "replay-v1"
+    assert report.case_reports
+    assert 0.0 <= report.execution_coverage <= 1.0
+    assert 0.0 <= report.surrogate_coverage <= 1.0
+    assert report.confidence in {"low", "medium", "high"}
--- a/app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_eval_report_model.py
@ -0,0 +1,61 @@
+from __future__ import annotations
+
+from beaver.memory.skills import SkillDraftEvalReport
+
+
+def test_eval_report_defaults_preserve_legacy_payload_shape() -> None:
+    report = SkillDraftEvalReport(
+        report_id="eval-1",
+        skill_name="debug",
+        draft_id="draft-1",
+        candidate_id="candidate-1",
+        passed=True,
+        baseline_score_avg=0.5,
+        candidate_score_avg=0.8,
+        score_delta=0.3,
+        regression_count=0,
+        improved_count=2,
+        unchanged_count=0,
+        cases=[{"run_id": "run-1"}],
+        status="completed",
+        created_at="now",
+    )
+
+    payload = report.to_dict()
+
+    assert payload["eval_version"] == "heuristic-v1"
+    assert payload["mode"] == "heuristic"
+    assert payload["execution_coverage"] == 0.0
+    assert payload["surrogate_coverage"] == 0.0
+    assert payload["blocked_coverage"] == 0.0
+    assert payload["confidence"] == "low"
+    assert payload["case_reports"] == []
+    assert payload["tool_mode_summary"] == {}
+    assert payload["preservation_report"] is None
+    assert payload["cases"] == [{"run_id": "run-1"}]
+
+
+def test_eval_report_reads_legacy_payload_without_replay_fields() -> None:
+    report = SkillDraftEvalReport.from_dict(
+        {
+            "report_id": "eval-legacy",
+            "skill_name": "debug",
+            "draft_id": "draft-1",
+            "candidate_id": "candidate-1",
+            "passed": True,
+            "baseline_score_avg": 0.4,
+            "candidate_score_avg": 0.8,
+            "score_delta": 0.4,
+            "regression_count": 0,
+            "improved_count": 1,
+            "unchanged_count": 0,
+            "cases": [{"run_id": "run-1"}],
+            "status": "completed",
+            "created_at": "now",
+        }
+    )
+
+    assert report.eval_version == "heuristic-v1"
+    assert report.mode == "heuristic"
+    assert report.confidence == "low"
+    assert report.case_reports == []
--- a/app-instance/backend/tests/unit/test_skill_learning_pipeline.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_pipeline.py
@ -5,7 +5,7 @@ from pathlib import Path
 import pytest

 from beaver.memory.runs import RunMemoryStore
-from beaver.memory.skills import SkillLearningCandidate, SkillLearningStore
+from beaver.memory.skills import SkillDraftEvalReport, SkillLearningCandidate, SkillLearningStore
 from beaver.skills.drafts import DraftService
 from beaver.skills.learning import EvidenceSelector, SkillDraftSynthesizer, SkillLearningPipelineService, SkillLearningService
 from beaver.skills.publisher import SkillPublisher
@ -132,3 +132,77 @@ def test_pipeline_reject_removes_draft_from_review_list(tmp_path: Path) -> None:

    assert review.status == SkillReviewState.REJECTED.value
    assert pipeline.list_drafts() == []
+
+
+def test_publish_blocks_low_confidence_replay_report(tmp_path: Path) -> None:
+    pipeline = _pipeline(tmp_path)
+    draft = pipeline.draft_service.create_new_skill_draft(
+        skill_name="low-confidence",
+        proposed_content="# Low\n\nDo it.",
+        proposed_frontmatter={"description": "low", "tools": []},
+        created_by="test",
+        reason="test",
+    )
+    pipeline.learning_store.write_eval_report(
+        SkillDraftEvalReport(
+            report_id="eval-low",
+            skill_name=draft.skill_name,
+            draft_id=draft.draft_id,
+            candidate_id="candidate-1",
+            passed=True,
+            baseline_score_avg=0.7,
+            candidate_score_avg=0.9,
+            score_delta=0.2,
+            regression_count=0,
+            improved_count=1,
+            unchanged_count=0,
+            confidence="low",
+            mode="replay",
+            eval_version="replay-v1",
+            execution_coverage=0.0,
+            surrogate_coverage=1.0,
+            blocked_coverage=0.0,
+        )
+    )
+    pipeline.submit_review(draft.skill_name, draft.draft_id, requested_by="tester")
+    pipeline.approve(draft.skill_name, draft.draft_id, reviewer="tester")
+    pipeline.check_safety(draft.skill_name, draft.draft_id)
+
+    with pytest.raises(ValueError, match="low confidence"):
+        pipeline.publish(draft.skill_name, draft.draft_id, publisher="tester")
+
+
+def test_publish_blocks_failed_preservation_report(tmp_path: Path) -> None:
+    pipeline = _pipeline(tmp_path)
+    draft = pipeline.draft_service.create_new_skill_draft(
+        skill_name="dropped-section",
+        proposed_content="# Skill\n\n## Workflow\n\nDo it.",
+        proposed_frontmatter={"description": "dropped", "tools": []},
+        created_by="test",
+        reason="test",
+    )
+    pipeline.learning_store.write_eval_report(
+        SkillDraftEvalReport(
+            report_id="eval-preservation",
+            skill_name=draft.skill_name,
+            draft_id=draft.draft_id,
+            candidate_id="candidate-1",
+            passed=True,
+            baseline_score_avg=0.7,
+            candidate_score_avg=0.9,
+            score_delta=0.2,
+            regression_count=0,
+            improved_count=1,
+            unchanged_count=0,
+            confidence="medium",
+            mode="replay",
+            eval_version="replay-v1",
+            preservation_report={"passed": False, "risk_level": "high", "dropped_sections": ["Safety"]},
+        )
+    )
+    pipeline.submit_review(draft.skill_name, draft.draft_id, requested_by="tester")
+    pipeline.approve(draft.skill_name, draft.draft_id, reviewer="tester")
+    pipeline.check_safety(draft.skill_name, draft.draft_id)
+
+    with pytest.raises(ValueError, match="preservation"):
+        pipeline.publish(draft.skill_name, draft.draft_id, publisher="tester")
--- a/app-instance/backend/tests/unit/test_skill_learning_preservation.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_preservation.py
@ -0,0 +1,27 @@
+from __future__ import annotations
+
+from beaver.skills.learning.preservation import check_preservation
+
+
+def test_preservation_passes_when_base_sections_remain() -> None:
+    base = "# Skill\n\n## Workflow\n\n- Read first.\n\n## Safety\n\n- Do not delete files.\n"
+    draft = "# Skill\n\n## Workflow\n\n- Read first.\n- Then write.\n\n## Safety\n\n- Do not delete files.\n"
+
+    report = check_preservation(base_content=base, draft_content=draft)
+
+    assert report["passed"] is True
+    assert report["risk_level"] == "low"
+    assert "Workflow" in report["preserved_sections"]
+    assert "Safety" in report["preserved_sections"]
+    assert report["dropped_sections"] == []
+
+
+def test_preservation_flags_dropped_section() -> None:
+    base = "# Skill\n\n## Workflow\n\n- Read first.\n\n## Safety\n\n- Do not delete files.\n"
+    draft = "# Skill\n\n## Workflow\n\n- Read first.\n"
+
+    report = check_preservation(base_content=base, draft_content=draft)
+
+    assert report["passed"] is False
+    assert report["risk_level"] == "high"
+    assert "Safety" in report["dropped_sections"]
--- a/app-instance/backend/tests/unit/test_skill_learning_replay.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_replay.py
@ -0,0 +1,67 @@
+from __future__ import annotations
+
+import asyncio
+
+from beaver.skills.learning.replay import ReplayToolExecutor, ReplayToolPolicy, classify_tool_mode
+from beaver.tools.base import BaseTool, ToolContext, ToolResult, ToolSpec
+from beaver.tools.registry.tool_registry import ToolRegistry
+from beaver.tools.runtime.executor import ToolExecutor
+
+
+class FakeTool(BaseTool):
+    def __init__(self, name: str, *, toolset: str = "filesystem", metadata: dict | None = None) -> None:
+        self._spec = ToolSpec(
+            name=name,
+            description=f"{name} tool",
+            input_schema={"type": "object", "properties": {"path": {"type": "string"}}},
+            toolset=toolset,
+            metadata=metadata or {},
+        )
+
+    @property
+    def spec(self) -> ToolSpec:
+        return self._spec
+
+    async def invoke(self, arguments: dict, context: ToolContext) -> ToolResult:
+        return ToolResult(success=True, content=f"executed:{arguments}", tool_name=self.spec.name)
+
+
+def _executor(*tools: FakeTool) -> ReplayToolExecutor:
+    registry = ToolRegistry()
+    for tool in tools:
+        registry.register(tool)
+    return ReplayToolExecutor(ToolExecutor(registry), registry=registry, policy=ReplayToolPolicy())
+
+
+def test_classify_tool_modes_from_spec() -> None:
+    assert classify_tool_mode(FakeTool("read_file").spec) == "executed"
+    assert classify_tool_mode(FakeTool("write_file").spec) == "executed"
+    assert classify_tool_mode(FakeTool("mcp_outlook_send_email", toolset="mcp", metadata={"transport": "mcp"}).spec) == "surrogate"
+    assert classify_tool_mode(FakeTool("delete_account", toolset="mcp", metadata={"transport": "mcp"}).spec) == "blocked"
+
+
+def test_replay_executor_executes_safe_tool_and_records_trace() -> None:
+    executor = _executor(FakeTool("write_file"))
+
+    result = asyncio.run(executor.execute("write_file", {"path": "a.txt"}, context=ToolContext(workspace="/tmp/replay")))
+
+    assert result.success is True
+    assert result.content.startswith("executed:")
+    assert executor.traces[0]["mode"] == "executed"
+    assert executor.traces[0]["tool_name"] == "write_file"
+
+
+def test_replay_executor_surrogates_external_write_and_blocks_destructive() -> None:
+    executor = _executor(
+        FakeTool("mcp_outlook_send_email", toolset="mcp", metadata={"transport": "mcp"}),
+        FakeTool("delete_account", toolset="mcp", metadata={"transport": "mcp"}),
+    )
+
+    send = asyncio.run(executor.execute("mcp_outlook_send_email", {"to": "ada@example.com"}, context=ToolContext()))
+    delete = asyncio.run(executor.execute("delete_account", {"id": "1"}, context=ToolContext()))
+
+    assert send.success is True
+    assert send.error == "replay_surrogate"
+    assert delete.success is False
+    assert delete.error == "replay_blocked"
+    assert [trace["mode"] for trace in executor.traces] == ["surrogate", "blocked"]
--- a/app-instance/backend/tests/unit/test_skill_learning_replay_runner.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_replay_runner.py
@ -0,0 +1,36 @@
+from __future__ import annotations
+
+import asyncio
+from types import SimpleNamespace
+
+from beaver.skills.learning.replay import ReplayArmRequest, ReplayRunner
+
+
+class FakeAgentLoop:
+    def boot(self):
+        return SimpleNamespace(tool_executor=SimpleNamespace(), tool_registry=SimpleNamespace(get=lambda name: None))
+
+    async def process_direct(self, task: str, **kwargs):
+        executor = kwargs["tool_executor_override"]
+        await executor.execute("mcp_outlook_send_email", {"to": "ada@example.com"})
+        return SimpleNamespace(session_id="session-replay", run_id="run-replay", output_text="done", finish_reason="stop")
+
+
+def test_replay_runner_returns_arm_report_with_tool_trace() -> None:
+    runner = ReplayRunner(agent_loop=FakeAgentLoop())
+    request = ReplayArmRequest(
+        case_id="case-1",
+        arm="candidate",
+        task_text="Send a status email to Ada.",
+        pinned_skill_names=[],
+        pinned_skill_contexts=[],
+        provider_bundle=object(),
+        model_settings={"max_tool_iterations": 2},
+    )
+
+    report = asyncio.run(runner.run_arm(request))
+
+    assert report["case_id"] == "case-1"
+    assert report["arm"] == "candidate"
+    assert report["finish_reason"] == "stop"
+    assert report["tool_calls"][0]["tool_name"] == "mcp_outlook_send_email"
--- a/app-instance/backend/tests/unit/test_skill_learning_surrogate.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_surrogate.py
@ -0,0 +1,31 @@
+from __future__ import annotations
+
+import asyncio
+
+from beaver.skills.learning.surrogate import SurrogateToolEvaluator
+
+
+def test_surrogate_scores_complete_candidate_higher_than_missing_baseline() -> None:
+    evaluator = SurrogateToolEvaluator()
+    baseline = {
+        "arm": "baseline",
+        "tool_calls": [
+            {"tool_name": "mcp_outlook_send_email", "mode": "surrogate", "arguments": {"to": "", "subject": ""}},
+        ],
+    }
+    candidate = {
+        "arm": "candidate",
+        "tool_calls": [
+            {
+                "tool_name": "mcp_outlook_send_email",
+                "mode": "surrogate",
+                "arguments": {"to": "ada@example.com", "subject": "Status", "body": "Done"},
+            },
+        ],
+    }
+
+    result = asyncio.run(evaluator.evaluate(task_text="Send a status email to Ada.", baseline=baseline, candidate=candidate))
+
+    assert result["candidate_score"] > result["baseline_score"]
+    assert result["surrogate_tool_count"] == 2
+    assert result["confidence"] in {"low", "medium"}
--- a/app-instance/backend/tests/unit/test_skill_learning_synthesizer_preservation.py
+++ b/app-instance/backend/tests/unit/test_skill_learning_synthesizer_preservation.py
@ -0,0 +1,41 @@
+from __future__ import annotations
+
+from beaver.memory.skills import SkillLearningCandidate
+from beaver.skills.learning.evidence import EvidencePacket
+from beaver.skills.learning.synthesizer import SkillDraftSynthesizer
+
+
+def test_revision_prompt_includes_base_skill_snapshot() -> None:
+    candidate = SkillLearningCandidate(
+        candidate_id="candidate-1",
+        kind="revise_skill",
+        source_run_ids=["run-1"],
+        source_session_ids=["session-1"],
+        related_skill_names=["debug-skill"],
+        reason="Improve debugging flow.",
+    )
+    packet = EvidencePacket(
+        run_ids=["run-1"],
+        session_ids=["session-1"],
+        task_summaries=["debug a failing test"],
+        session_excerpts=["assistant: fixed it"],
+    )
+    prompt = SkillDraftSynthesizer._build_prompt(
+        candidate,
+        packet,
+        "revise",
+        base_skill={
+            "skill_name": "debug-skill",
+            "version": "v0001",
+            "frontmatter": {"description": "Debug tests", "tools": ["read_file"]},
+            "content": "# Debug Skill\n\n## Safety\n\nDo not delete files.",
+            "summary": "Debug tests safely.",
+            "tool_hints": ["read_file"],
+        },
+    )
+
+    assert "Base skill snapshot" in prompt
+    assert "# Debug Skill" in prompt
+    assert "Do not delete files." in prompt
+    assert "preserved_sections" in prompt
+    assert "dropped_sections" in prompt
--- a/app-instance/backend/tests/unit/test_task_mode_feedback.py
+++ b/app-instance/backend/tests/unit/test_task_mode_feedback.py
@ -15,6 +15,7 @@ class StubProvider(LLMProvider):
    def __init__(self, responses: list[LLMResponse]) -> None:
        super().__init__()
        self._responses = list(responses)
+        self.seen_messages: list[list[dict]] = []

    async def chat(
        self,
@ -26,6 +27,7 @@ class StubProvider(LLMProvider):
    ) -> LLMResponse:
        if not self._responses:
            raise AssertionError("No stubbed provider responses left")
+        self.seen_messages.append(messages)
        return self._responses.pop(0)

    def get_default_model(self) -> str:
@ -99,6 +101,52 @@ def test_task_run_records_evidence_and_waits_for_acceptance(tmp_path: Path) -> N
    assert "validated" not in event_types


+def test_task_mode_injects_prompt_locale_output_language(tmp_path: Path) -> None:
+    service = AgentService(
+        loader=EngineLoader(
+            workspace=tmp_path,
+            task_execution_planner=StubTaskExecutionPlanner(),
+        )
+    )
+    main_provider = StubProvider(
+        [
+            LLMResponse(
+                content="Done",
+                finish_reason="stop",
+                provider_name="stub",
+                model="stub-model",
+            )
+        ]
+    )
+    bundle = ProviderBundle(
+        main_runtime=SimpleNamespace(model="stub-model", provider_name="stub"),
+        main_provider=main_provider,
+        auxiliary_runtime=SimpleNamespace(model="stub-model", provider_name="stub"),
+        auxiliary_provider=StubProvider([_route_response("new_task", "Product summary")]),
+    )
+
+    result = asyncio.run(
+        service.process_direct(
+            "Summarize the uploaded report in English",
+            session_id="web:locale-task",
+            prompt_locale="en",
+            provider_bundle=bundle,
+        )
+    )
+
+    assert result.task_id
+    assert main_provider.seen_messages
+    system_prompt = main_provider.seen_messages[-1][0]["content"]
+    assert "Use English for user-facing replies" in system_prompt
+    assert "Output language: English." in system_prompt
+
+    task_service = service.create_loop().boot().task_service
+    assert task_service is not None
+    task = task_service.get_task(result.task_id)
+    assert task is not None
+    assert task.metadata["prompt_locale"] == "en"
+
+
 def test_unrelated_simple_chat_auto_accepts_active_task(tmp_path: Path) -> None:
    service = AgentService(
        loader=EngineLoader(
--- a/app-instance/backend/tests/unit/test_websocket_chat.py
+++ b/app-instance/backend/tests/unit/test_websocket_chat.py
@ -73,6 +73,7 @@ def test_websocket_message_returns_chat_metadata_and_session_updated() -> None:
                {
                    "type": "message",
                    "content": "hello",
+                    "prompt_locale": "zh-Hant",
                    "metadata": {"source": "test"},
                    "attachments": [{"file_id": "file-1", "name": "a.txt"}],
                }
@ -89,6 +90,7 @@ def test_websocket_message_returns_chat_metadata_and_session_updated() -> None:
            "user_id": None,
            "title": None,
            "execution_context": None,
+            "prompt_locale": "zh-Hant",
            "model": None,
            "provider_name": None,
            "embedding_model": None,
@ -134,6 +136,7 @@ def test_websocket_message_uses_direct_processing_when_loop_is_not_running() ->
            "user_id": None,
            "title": None,
            "execution_context": None,
+            "prompt_locale": None,
            "model": None,
            "provider_name": None,
            "embedding_model": None,
@ -149,7 +152,10 @@ def test_rest_chat_uses_direct_processing_when_loop_is_not_running() -> None:
    app = create_app(service=service, manage_service_lifecycle=False)

    with TestClient(app) as client:
-        response = client.post("/api/chat", json={"session_id": "web:alpha", "message": "hello"})
+        response = client.post(
+            "/api/chat",
+            json={"session_id": "web:alpha", "message": "hello", "prompt_locale": "en"},
+        )

    assert response.status_code == 200
    assert service.calls == [
@ -160,6 +166,7 @@ def test_rest_chat_uses_direct_processing_when_loop_is_not_running() -> None:
            "user_id": None,
            "title": None,
            "execution_context": None,
+            "prompt_locale": "en",
            "model": None,
            "provider_name": None,
            "embedding_model": None,
--- a/app-instance/create-instance.sh
+++ b/app-instance/create-instance.sh
@ -18,6 +18,7 @@ AUTHZ_BASE_URL=""
 AUTHZ_INTERNAL_TOKEN=""
 AUTHZ_OUTLOOK_MCP_URL=""
 OUTLOOK_MCP_SERVER_ID="${OUTLOOK_MCP_SERVER_ID:-outlook_mcp}"
+OUTLOOK_MCP_CALL_TIMEOUT_SECONDS="${OUTLOOK_MCP_CALL_TIMEOUT_SECONDS:-60}"
 USER_FILES_MAX_UPLOAD_BYTES="${USER_FILES_MAX_UPLOAD_BYTES:-}"
 EXTERNAL_CONNECTOR_BASE_URL="${EXTERNAL_CONNECTOR_BASE_URL:-http://external-connector:8787}"
 EXTERNAL_CONNECTOR_TOKEN="${EXTERNAL_CONNECTOR_TOKEN:-}"
@ -76,6 +77,8 @@ Optional:
                              Managed Outlook MCP URL for AuthZ mode.
  --outlook-mcp-server-id <id>
                              Default Outlook MCP server id. Default: outlook_mcp
+  --outlook-mcp-call-timeout-seconds <seconds>
+                              Backend wait timeout for Outlook MCP calls. Default: 60
  --user-files-max-upload-bytes <bytes>
                              Optional max upload size for the user file system.
  --external-connector-base-url <url>
@ -557,6 +560,10 @@ while [[ $# -gt 0 ]]; do
      OUTLOOK_MCP_SERVER_ID="${2:-}"
      shift 2
      ;;
+    --outlook-mcp-call-timeout-seconds)
+      OUTLOOK_MCP_CALL_TIMEOUT_SECONDS="${2:-}"
+      shift 2
+      ;;
    --user-files-max-upload-bytes)
      USER_FILES_MAX_UPLOAD_BYTES="${2:-}"
      shift 2
@ -774,6 +781,7 @@ RUN_ARGS=(
  -e "APP_BACKEND_PORT=18080"
  -e "BEAVER_ENABLE_SELF_RESTART=1"
  -e "BEAVER_OUTLOOK_MCP_SERVER_ID=${OUTLOOK_MCP_SERVER_ID}"
+  -e "BEAVER_OUTLOOK_MCP_CALL_TIMEOUT_SECONDS=${OUTLOOK_MCP_CALL_TIMEOUT_SECONDS}"
  -e "EXTERNAL_CONNECTOR_BASE_URL=${EXTERNAL_CONNECTOR_BASE_URL}"
  --label "beaver.instance.id=${INSTANCE_ID}"
  --label "beaver.instance.slug=${INSTANCE_SLUG}"
--- a/app-instance/frontend/app/(app)/page.tsx
+++ b/app-instance/frontend/app/(app)/page.tsx
@ -23,6 +23,7 @@ import {
  getSession,
  getSessionProcess,
  listSessions,
+  promptLocaleForAppLocale,
  sendMessage,
  submitChatFeedback,
  uploadFile,
@ -44,7 +45,7 @@ function isSessionUpdatedEvent(data: WsEvent | Record<string, unknown>): data is
  return data.type === 'session_updated' && typeof data.session_id === 'string';
 }

-function activeTaskStatusLabel(status: string, locale: 'zh-CN' | 'en-US') {
+function activeTaskStatusLabel(status: string, locale: string) {
  if (status === 'needs_revision') return pickAppText(locale, '待修改', 'Needs revision');
  if (status === 'awaiting_acceptance') return pickAppText(locale, '待验收', 'Awaiting acceptance');
  if (status === 'running') return pickAppText(locale, '进行中', 'Running');
@ -140,8 +141,9 @@ export default function ChatPage() {
        liveRuns: processRuns,
        liveEvents: processEvents,
        liveArtifacts: processArtifacts,
+        locale,
      }),
-    [activeTaskDetail, processArtifacts, processEvents, processRuns]
+    [activeTaskDetail, locale, processArtifacts, processEvents, processRuns]
  );

  const loadSessions = useCallback(async () => {
@ -400,6 +402,7 @@ export default function ChatPage() {
        type: 'message',
        content: msgContent,
        thinking_enabled: thinkingModeEnabled,
+        prompt_locale: promptLocaleForAppLocale(locale),
      };
      if (attachments.length > 0) {
        wsPayload.attachments = attachments;
--- a/app-instance/frontend/app/(app)/skills/page.tsx
+++ b/app-instance/frontend/app/(app)/skills/page.tsx
@ -1088,6 +1088,12 @@ function EvalReportPanel({ report }: { report?: SkillDraftEvalReport | null }) {
        />
      </div>

+      <div className="mt-3 grid gap-2 sm:grid-cols-3">
+        <MetricTile label={t('执行覆盖', 'Execution')} value={formatPercent(report.execution_coverage)} />
+        <MetricTile label={t('替代评估', 'Surrogate')} value={formatPercent(report.surrogate_coverage)} />
+        <MetricTile label={t('置信度', 'Confidence')} value={report.confidence || 'low'} />
+      </div>
+
      <div className="mt-3 grid gap-2 sm:grid-cols-3">
        <ReadableFact icon={<CheckCircle2 className="h-4 w-4" />} label={t('改进', 'Improved')} value={String(report.improved_count)} />
        <ReadableFact icon={<XCircle className="h-4 w-4" />} label={t('回退', 'Regressed')} value={String(report.regression_count)} />
@ -1135,6 +1141,12 @@ function EvalReportPanel({ report }: { report?: SkillDraftEvalReport | null }) {
          </div>
        </div>
      )}
+      {Array.isArray(report.case_reports) && report.case_reports.length > 0 ? (
+        <RawDetails title={t('Replay case reports', 'Replay case reports')} payload={report.case_reports} />
+      ) : null}
+      {report.preservation_report ? (
+        <RawDetails title={t('Preservation report', 'Preservation report')} payload={report.preservation_report} />
+      ) : null}
      <div className="mt-3 text-xs text-muted-foreground">{formatDateTime(report.created_at)}</div>
      <RawDetails title={t('原始评估报告', 'Raw eval report')} payload={report} />
    </div>
@ -1387,6 +1399,11 @@ function formatScore(value: number): string {
  return value.toFixed(2);
 }

+function formatPercent(value?: number | null): string {
+  if (typeof value !== 'number' || Number.isNaN(value)) return '0%';
+  return `${Math.round(value * 100)}%`;
+}
+
 function formatSignedScore(value: number): string {
  if (!Number.isFinite(value)) return '-';
  return `${value >= 0 ? '+' : ''}${value.toFixed(2)}`;
--- a/app-instance/frontend/app/(app)/tasks/[taskId]/page.tsx
+++ b/app-instance/frontend/app/(app)/tasks/[taskId]/page.tsx
@ -97,8 +97,9 @@ export default function TaskDetailPage() {
        liveRuns: processRuns,
        liveEvents: processEvents,
        liveArtifacts: processArtifacts,
+        locale,
      }),
-    [backendTask, processArtifacts, processEvents, processRuns]
+    [backendTask, locale, processArtifacts, processEvents, processRuns]
  );
  const timelineCards = timelineView?.cards ?? [];

--- a/app-instance/frontend/app/(app)/tasks/page.tsx
+++ b/app-instance/frontend/app/(app)/tasks/page.tsx
@ -222,7 +222,7 @@ function OrdinaryTaskCard({
  onDelete,
 }: {
  task: BackendTask;
-  locale: 'zh-CN' | 'en-US';
+  locale: string;
  onDelete: () => void;
 }) {
  const title = task.short_title || String(task.metadata?.short_title || '') || task.description || task.goal || task.task_id;
@ -284,7 +284,7 @@ function OrdinaryTaskCard({
  );
 }

-function taskStatusLabel(status: string, locale: 'zh-CN' | 'en-US') {
+function taskStatusLabel(status: string, locale: string) {
  const labels: Record<string, [string, string]> = {
    open: ['已创建', 'Open'],
    running: ['执行中', 'Running'],
@ -297,7 +297,7 @@ function taskStatusLabel(status: string, locale: 'zh-CN' | 'en-US') {
  return label ? pickAppText(locale, label[0], label[1]) : status;
 }

-function taskSourceLabel(task: BackendTask, locale: 'zh-CN' | 'en-US') {
+function taskSourceLabel(task: BackendTask, locale: string) {
  if (task.metadata?.source === 'scheduled_run') {
    return pickAppText(locale, '定时通知修改', 'Scheduled notification revision');
  }
@ -520,7 +520,7 @@ function ScheduledJobCard({
  onRemove,
 }: {
  job: CronJob;
-  locale: 'zh-CN' | 'en-US';
+  locale: string;
  formatTime: (ms: number | null) => string;
  onToggle: (checked: boolean) => void;
  onRun: () => void;
--- a/app-instance/frontend/components/Header.tsx
+++ b/app-instance/frontend/components/Header.tsx
@ -155,7 +155,7 @@ const Header = () => {
            <div className="flex min-w-0 items-center gap-2">
              <button
                type="button"
-                className="inline-flex h-11 w-11 items-center justify-center rounded-full border border-[#E6E1DE] bg-white text-[#1D1715] transition-colors hover:bg-[#F7F5F4] 2xl:hidden"
+                className="inline-flex h-11 w-11 items-center justify-center rounded-full border border-[#E6E1DE] bg-white text-[#1D1715] transition-colors hover:bg-[#F7F5F4] min-[1800px]:hidden"
                aria-label={mobileMenuOpen ? pickAppText(locale, '关闭导航', 'Close navigation') : pickAppText(locale, '打开导航', 'Open navigation')}
                aria-expanded={mobileMenuOpen}
                aria-controls="app-primary-mobile-nav"
@ -170,7 +170,7 @@ const Header = () => {
              </Link>
            </div>

-            <nav className="hidden items-center gap-1 rounded-full border border-[#E6E1DE] bg-white px-1.5 py-1 shadow-[0_1px_2px_rgba(0,0,0,0.04)] 2xl:flex">
+            <nav className="hidden items-center gap-1 rounded-full border border-[#E6E1DE] bg-white px-1.5 py-1 shadow-[0_1px_2px_rgba(0,0,0,0.04)] min-[1800px]:flex">
              {renderNavLinks(false)}
            </nav>

@ -185,7 +185,7 @@ const Header = () => {
                    <PopoverTrigger asChild>
                      <button
                        type="button"
-                        className="flex h-11 w-11 items-center justify-center gap-2 rounded-full border border-[#E6E1DE] bg-white px-1.5 text-sm font-medium text-[#1D1715] transition-colors hover:bg-[#F7F5F4] sm:w-auto sm:justify-start sm:px-2"
+                        className="flex h-11 w-11 min-w-0 items-center justify-center gap-2 rounded-full border border-[#E6E1DE] bg-white px-1.5 text-sm font-medium text-[#1D1715] transition-colors hover:bg-[#F7F5F4] sm:w-auto sm:max-w-[11rem] sm:justify-start sm:px-2"
                        aria-label={pickAppText(locale, '打开账号菜单', 'Open account menu')}
                      >
                        <Avatar className="h-8 w-8 border border-[#E6E1DE]">
@ -193,7 +193,7 @@ const Header = () => {
                            {userInitial}
                          </AvatarFallback>
                        </Avatar>
-                        <span className="hidden max-w-28 truncate sm:block">{user.username}</span>
+                        <span className="hidden min-w-0 max-w-24 truncate sm:block">{user.username}</span>
                        <ChevronDown className="hidden h-4 w-4 text-muted-foreground sm:block" />
                      </button>
                    </PopoverTrigger>
@ -245,14 +245,14 @@ const Header = () => {
        <>
          <button
            type="button"
-            className="fixed inset-x-0 bottom-0 top-16 z-40 bg-black/40 2xl:hidden"
+            className="fixed inset-x-0 bottom-0 top-16 z-40 bg-black/40 min-[1800px]:hidden"
            aria-label={pickAppText(locale, '关闭导航', 'Close navigation')}
            onClick={() => setMobileMenuOpen(false)}
          />
          <nav
            id="app-primary-mobile-nav"
            aria-label={pickAppText(locale, '主导航', 'Primary navigation')}
-            className="fixed bottom-0 left-0 top-16 z-[45] isolate w-[min(86vw,320px)] overflow-y-auto border-r border-[#E6E1DE] bg-background text-foreground shadow-[12px_0_32px_rgba(29,23,21,0.24)] animate-in slide-in-from-left-full duration-200 2xl:hidden"
+            className="fixed bottom-0 left-0 top-16 z-[45] isolate w-[min(86vw,320px)] overflow-y-auto border-r border-[#E6E1DE] bg-background text-foreground shadow-[12px_0_32px_rgba(29,23,21,0.24)] animate-in slide-in-from-left-full duration-200 min-[1800px]:hidden"
          >
            <div className="min-h-full bg-background px-4 py-5">
              <div className="grid gap-2 bg-background">
--- a/app-instance/frontend/components/LanguageSwitcher.tsx
+++ b/app-instance/frontend/components/LanguageSwitcher.tsx
@ -2,40 +2,49 @@

 import { Languages } from 'lucide-react';

+import {
+  Select,
+  SelectContent,
+  SelectItem,
+  SelectTrigger,
+  SelectValue,
+} from '@/components/ui/select';
+import type { AppLocale } from '@/lib/i18n/core';
+import { pickAppText } from '@/lib/i18n/core';
 import { useAppI18n } from '@/lib/i18n/provider';
 import { cn } from '@/lib/utils';

 const OPTIONS = [
-  { value: 'zh-CN', label: 'ZH' },
-  { value: 'en-US', label: 'EN' },
+  { value: 'zh-CN', label: '中文', shortLabel: '中' },
+  { value: 'en-US', label: 'English', shortLabel: 'EN' },
+  { value: 'zh-Hant', label: '繁體中文', shortLabel: '繁' },
 ] as const;

 export function LanguageSwitcher({ className }: { className?: string }) {
  const { locale, setLocale } = useAppI18n();
+  const selectedOption = OPTIONS.find((option) => option.value === locale) ?? OPTIONS[0];

  return (
-    <div
-      className={cn(
-        'inline-flex items-center gap-1 rounded-md border border-border bg-muted/30 p-1',
-        className
-      )}
-    >
-      <Languages className="h-3.5 w-3.5 text-muted-foreground" />
-      {OPTIONS.map((option) => (
-        <button
-          key={option.value}
-          type="button"
-          onClick={() => setLocale(option.value)}
-          className={cn(
-            'h-11 w-11 rounded text-xs font-medium transition-colors',
-            locale === option.value
-              ? 'bg-background text-foreground shadow-sm'
-              : 'text-muted-foreground hover:text-foreground'
-          )}
-        >
-          {option.label}
-        </button>
-      ))}
-    </div>
+    <Select value={locale} onValueChange={(value) => setLocale(value as AppLocale)}>
+      <SelectTrigger
+        className={cn('h-11 w-[92px] gap-1.5 bg-muted/30 px-2 sm:w-[138px] sm:gap-2 sm:px-3', className)}
+        aria-label={pickAppText(locale, '选择语言', 'Select language')}
+      >
+        <Languages className="h-3.5 w-3.5 shrink-0 text-muted-foreground" />
+        <SelectValue aria-label={selectedOption.label}>
+          <span className="min-w-0 flex-1 truncate text-left">
+            <span className="sm:hidden">{selectedOption.shortLabel}</span>
+            <span className="hidden sm:inline">{selectedOption.label}</span>
+          </span>
+        </SelectValue>
+      </SelectTrigger>
+      <SelectContent align="end">
+        {OPTIONS.map((option) => (
+          <SelectItem key={option.value} value={option.value}>
+            {option.label}
+          </SelectItem>
+        ))}
+      </SelectContent>
+    </Select>
  );
 }
--- a/app-instance/frontend/components/chat-workbench/AgentTeamBlock.tsx
+++ b/app-instance/frontend/components/chat-workbench/AgentTeamBlock.tsx
@ -6,6 +6,7 @@ import { CheckCircle2, Loader2, Sparkles } from 'lucide-react';
 import type { ProcessArtifact, ProcessEvent, ProcessRun } from '@/types';
 import { Badge } from '@/components/ui/badge';
 import { appArtifactPreview, appFeedRoleLabel, appStatusLabel } from '@/lib/i18n/common';
+import type { AppLocale } from '@/lib/i18n/core';
 import { pickAppText } from '@/lib/i18n/core';
 import { useAppI18n } from '@/lib/i18n/provider';
 import { cn } from '@/lib/utils';
@ -84,7 +85,7 @@ function buildFeed(
  run: ProcessRun,
  events: ProcessEvent[],
  artifacts: ProcessArtifact[],
-  locale: 'zh-CN' | 'en-US',
+  locale: AppLocale,
 ): AgentFeedItem[] {
  const items: AgentFeedItem[] = [];
  let hasLeadBubble = false;
@ -152,7 +153,7 @@ function buildFeed(
    .slice(-8);
 }

-function runSummary(run: ProcessRun, feed: AgentFeedItem[], locale: 'zh-CN' | 'en-US'): string {
+function runSummary(run: ProcessRun, feed: AgentFeedItem[], locale: AppLocale): string {
  if (run.summary?.trim()) {
    return run.summary.trim();
  }
@ -262,7 +263,7 @@ function AgentBubble({
  locale,
 }: {
  item: AgentFeedItem;
-  locale: 'zh-CN' | 'en-US';
+  locale: AppLocale;
 }) {
  return (
    <div
@ -297,7 +298,7 @@ function LiveAgentCard({
  phase: RunCardPhase;
  accentIndex: number;
  onSelect: () => void;
-  locale: 'zh-CN' | 'en-US';
+  locale: AppLocale;
 }) {
  const showSpinner = !TERMINAL_STATUSES.has(run.status);
  const accent = accentFor(accentIndex);
@ -370,7 +371,7 @@ function ResultCard({
  selected: boolean;
  accentIndex: number;
  onSelect: () => void;
-  locale: 'zh-CN' | 'en-US';
+  locale: AppLocale;
 }) {
  const accent = accentFor(accentIndex);

--- a/app-instance/frontend/components/chat-workbench/ArtifactSidebar.tsx
+++ b/app-instance/frontend/components/chat-workbench/ArtifactSidebar.tsx
@ -18,7 +18,7 @@ function artifactIcon(type: ProcessArtifact['artifact_type']) {
  return <FileOutput className="w-4 h-4" />;
 }

-function renderArtifactBody(artifact: ProcessArtifact, locale: 'zh-CN' | 'en-US') {
+function renderArtifactBody(artifact: ProcessArtifact, locale: string) {
  if (artifact.artifact_type === 'json' && artifact.data !== undefined) {
    return (
      <pre className="text-[11px] leading-5 whitespace-pre-wrap break-words rounded-md bg-background/70 p-3 overflow-x-auto">
--- a/app-instance/frontend/components/chat-workbench/CurrentSessionProgressSidebar.tsx
+++ b/app-instance/frontend/components/chat-workbench/CurrentSessionProgressSidebar.tsx
@ -21,17 +21,19 @@ function ProgressPanel({
  const { locale } = useAppI18n();

  return (
-    <div className="flex h-full flex-col bg-[#FBFAF9]">
-      <div className="flex h-16 shrink-0 items-center justify-between border-b border-[#E6E1DE] px-5">
-        <div>
-          <h2 className="text-base font-semibold text-foreground">
+    <div className="flex h-full min-w-0 flex-col overflow-hidden bg-[#FBFAF9]">
+      <div className="flex h-16 min-w-0 shrink-0 items-center justify-between gap-3 border-b border-[#E6E1DE] px-5">
+        <div className="min-w-0">
+          <h2 className="truncate text-base font-semibold text-foreground">
            {pickAppText(locale, '当前会话的运行进度', 'Current Session Progress')}
          </h2>
-          <p className="flex items-center gap-1.5 text-xs text-muted-foreground">
+          <p className="flex min-w-0 items-center gap-1.5 text-xs text-muted-foreground">
            {isLive ? <Activity className="h-3.5 w-3.5" /> : null}
-            {isLive
-              ? pickAppText(locale, '任务时间线实时更新', 'Task timeline updates live')
-              : pickAppText(locale, '与任务详情时间线一致', 'Matches the Task detail timeline')}
+            <span className="truncate">
+              {isLive
+                ? pickAppText(locale, '任务时间线实时更新', 'Task timeline updates live')
+                : pickAppText(locale, '与任务详情时间线一致', 'Matches the Task detail timeline')}
+            </span>
          </p>
        </div>
        {onClose ? (
@ -46,8 +48,8 @@ function ProgressPanel({
        ) : null}
      </div>

-      <ScrollArea className="min-h-0 flex-1 px-4 py-4">
-        <div className="pb-6">
+      <ScrollArea className="min-h-0 min-w-0 flex-1 overflow-hidden px-4 py-4">
+        <div className="min-w-0 max-w-full pb-6">
          <TaskTimeline cards={cards} isLive={isLive} showHeader={false} />
        </div>
      </ScrollArea>
@ -67,7 +69,7 @@ export function CurrentSessionProgressSidebar({

  return (
    <>
-      <aside className="hidden h-full w-[380px] shrink-0 border-l border-[#E6E1DE] xl:flex">
+      <aside className="hidden h-full w-[380px] min-w-0 shrink-0 overflow-hidden border-l border-[#E6E1DE] xl:flex">
        <ProgressPanel cards={cards} isLive={isLive} />
      </aside>

@ -88,7 +90,7 @@ export function CurrentSessionProgressSidebar({
            onClick={() => setMobileOpen(false)}
            aria-label={pickAppText(locale, '关闭进度面板', 'Close progress panel')}
          />
-          <div className="absolute inset-y-0 right-0 w-[min(92vw,390px)] border-l border-[#E6E1DE] shadow-2xl">
+          <div className="absolute inset-y-0 right-0 w-[min(92vw,390px)] min-w-0 overflow-hidden border-l border-[#E6E1DE] shadow-2xl">
            <ProgressPanel cards={cards} isLive={isLive} onClose={() => setMobileOpen(false)} />
          </div>
        </div>
--- a/app-instance/frontend/components/task-detail/TaskAcceptanceCard.tsx
+++ b/app-instance/frontend/components/task-detail/TaskAcceptanceCard.tsx
@ -55,14 +55,14 @@ function feedbackKind(item: TaskFeedbackItem): string {
  return String(item.acceptance_type || item.feedback_type || '');
 }

-function humanFeedback(type: string, locale: 'zh-CN' | 'en-US') {
+function humanFeedback(type: string, locale: string) {
  if (type === 'accept' || type === 'satisfied') return pickAppText(locale, '接受', 'Accepted');
  if (type === 'revise') return pickAppText(locale, '请求修改', 'Revision requested');
  if (type === 'abandon') return pickAppText(locale, '放弃任务', 'Abandoned');
  return type || pickAppText(locale, '验收', 'Acceptance');
 }

-function humanTaskStatus(status: string, locale: 'zh-CN' | 'en-US') {
+function humanTaskStatus(status: string, locale: string) {
  const labels: Record<string, [string, string]> = {
    open: ['已创建', 'Open'],
    running: ['执行中', 'Running'],
--- a/app-instance/frontend/components/task-detail/TaskLiveHeader.tsx
+++ b/app-instance/frontend/components/task-detail/TaskLiveHeader.tsx
@ -24,7 +24,7 @@ function isRuntimeStatus(status: string): status is TaskRuntimeStatus {
  return RUNTIME_STATUSES.has(status);
 }

-function humanTaskStatus(status: string, locale: 'zh-CN' | 'en-US') {
+function humanTaskStatus(status: string, locale: string) {
  const map: Record<string, [string, string]> = {
    open: ['已创建', 'Open'],
    running: ['执行中', 'Running'],
--- a/app-instance/frontend/components/task-detail/TaskSideRail.tsx
+++ b/app-instance/frontend/components/task-detail/TaskSideRail.tsx
@ -26,7 +26,7 @@ function isRuntimeStatus(status: string): status is TaskRuntimeStatus {
  return RUNTIME_STATUSES.has(status);
 }

-function humanTaskStatus(status: string, locale: 'zh-CN' | 'en-US') {
+function humanTaskStatus(status: string, locale: string) {
  const map: Record<string, [string, string]> = {
    open: ['已创建', 'Open'],
    running: ['执行中', 'Running'],
@ -47,7 +47,7 @@ function latestFeedback(task: BackendTask): Record<string, unknown> | null {
  return [...(task.feedback ?? [])].reverse()[0] ?? null;
 }

-function acceptanceState(task: BackendTask, locale: 'zh-CN' | 'en-US'): string {
+function acceptanceState(task: BackendTask, locale: string): string {
  const feedback = latestFeedback(task);
  const kind = String(feedback?.acceptance_type || feedback?.feedback_type || '');
  if (kind) return humanTaskStatus(kind, locale);
--- a/app-instance/frontend/components/task-detail/TaskTimelineCard.tsx
+++ b/app-instance/frontend/components/task-detail/TaskTimelineCard.tsx
@ -93,7 +93,7 @@ function detailsJson(details: Record<string, unknown>): string {
  }
 }

-function cardTypeLabel(type: TaskTimelineCardType, locale: 'zh-CN' | 'en-US') {
+function cardTypeLabel(type: TaskTimelineCardType, locale: string) {
  const labels: Record<TaskTimelineCardType, [string, string]> = {
    task_created: ['任务', 'Task'],
    plan: ['计划', 'Plan'],
@ -114,7 +114,7 @@ function cardTypeLabel(type: TaskTimelineCardType, locale: 'zh-CN' | 'en-US') {
  return pickAppText(locale, label[0], label[1]);
 }

-function humanStatus(status: string, locale: 'zh-CN' | 'en-US') {
+function humanStatus(status: string, locale: string) {
  const labels: Record<string, [string, string]> = {
    open: ['已创建', 'Open'],
    running: ['执行中', 'Running'],
@ -137,7 +137,7 @@ function historyVersions(details: Record<string, unknown> | undefined): Array<Re
  return Array.isArray(versions) ? versions.filter((item): item is Record<string, unknown> => Boolean(item) && typeof item === 'object') : [];
 }

-function renderHistoryStatus(version: Record<string, unknown>, locale: 'zh-CN' | 'en-US') {
+function renderHistoryStatus(version: Record<string, unknown>, locale: string) {
  const status = String(version.acceptanceType || version.status || '');
  return status ? humanStatus(status, locale) : pickAppText(locale, '历史版本', 'Previous version');
 }
@ -184,30 +184,30 @@ export function TaskTimelineCard({ card, resultAcceptance, reviewTargetId }: Pro
  return (
    <Card id={shouldRenderResultAcceptance ? reviewTargetId : undefined} className="min-w-0 max-w-full scroll-mt-44 overflow-hidden rounded-md">
      <CardContent className="p-4">
-        <div className="flex gap-3">
+        <div className="flex min-w-0 gap-3">
          <div className="flex h-9 w-9 shrink-0 items-center justify-center rounded-md bg-muted">
            <Icon className="h-4 w-4 text-muted-foreground" />
          </div>
          <div className="min-w-0 flex-1">
-            <div className="flex items-start justify-between gap-3">
-              <div className="min-w-0 flex-1">
-                <div className="flex min-w-0 items-center gap-2">
-                  <h3 className="min-w-0 flex-1 truncate text-sm font-semibold">{card.title}</h3>
-                  <Badge variant="secondary" className="shrink-0 text-[11px]">
+            <div className="flex min-w-0 flex-wrap items-start justify-between gap-2">
+              <div className="min-w-0 flex-1 basis-44">
+                <div className="flex min-w-0 flex-wrap items-center gap-2">
+                  <h3 className={`min-w-0 flex-1 basis-32 text-sm font-semibold ${containedLongTextClass}`}>{card.title}</h3>
+                  <Badge variant="secondary" className="max-w-full text-[11px]">
                    {cardTypeLabel(card.type, locale)}
                  </Badge>
                </div>
-                <div className="mt-1 flex flex-wrap gap-x-3 gap-y-1 text-xs text-muted-foreground">
-                  {card.actorName ? <span className={containedLongTextClass}>{card.actorName}</span> : null}
-                  <span>{formatTaskRuntimeTime(card.createdAt, locale)}</span>
-                  {card.runId ? <span className="font-mono">{card.runId.slice(0, 8)}</span> : null}
+                <div className="mt-1 flex min-w-0 flex-wrap gap-x-3 gap-y-1 text-xs text-muted-foreground">
+                  {card.actorName ? <span className={`max-w-full ${containedLongTextClass}`}>{card.actorName}</span> : null}
+                  <span className="max-w-full">{formatTaskRuntimeTime(card.createdAt, locale)}</span>
+                  {card.runId ? <span className={`max-w-full font-mono ${containedLongTextClass}`}>{card.runId.slice(0, 8)}</span> : null}
                </div>
              </div>
              {card.status ? (
                isRuntimeStatus(card.status) ? (
-                  <TaskRuntimeStatusBadge status={card.status} />
+                  <TaskRuntimeStatusBadge status={card.status} className={`max-w-full ${containedLongTextClass}`} />
                ) : (
-                  <Badge variant="outline" className="shrink-0 text-[11px]">
+                  <Badge variant="outline" className={`max-w-full text-[11px] ${containedLongTextClass}`}>
                    {humanStatus(card.status, locale)}
                  </Badge>
                )
@ -224,7 +224,7 @@ export function TaskTimelineCard({ card, resultAcceptance, reviewTargetId }: Pro

            {card.type === 'result_history' ? <TaskResultHistory card={card} /> : card.details ? (
              <details className="mt-3 min-w-0 max-w-full overflow-hidden rounded-md border border-border bg-muted/20 px-3 py-2 text-xs">
-                <summary className="flex min-h-[44px] cursor-pointer select-none items-center font-medium text-muted-foreground">
+                <summary className="flex min-h-[44px] min-w-0 cursor-pointer select-none items-center font-medium text-muted-foreground">
                  {pickAppText(locale, '详情 JSON', 'Details JSON')}
                </summary>
                <pre className={`mt-2 max-h-72 overflow-auto text-[11px] leading-5 text-muted-foreground ${containedJsonTextClass}`}>
--- a/app-instance/frontend/components/task-runtime/TaskRuntimeShared.tsx
+++ b/app-instance/frontend/components/task-runtime/TaskRuntimeShared.tsx
@ -35,7 +35,7 @@ export function TaskRuntimeStatusBadge({
  );
 }

-export function formatTaskRuntimeTime(value?: string | null, locale: 'zh-CN' | 'en-US' = 'zh-CN'): string {
+export function formatTaskRuntimeTime(value?: string | null, locale: string = 'zh-CN'): string {
  if (!value) return '-';
  const date = new Date(value);
  if (Number.isNaN(date.getTime())) return value;
@ -47,7 +47,7 @@ export function formatTaskRuntimeTime(value?: string | null, locale: 'zh-CN' | '
  }).format(date);
 }

-export function formatTaskRuntimeDuration(durationMs: number | null, locale: 'zh-CN' | 'en-US' = 'zh-CN'): string {
+export function formatTaskRuntimeDuration(durationMs: number | null, locale: string = 'zh-CN'): string {
  if (durationMs === null || durationMs < 0) return '-';
  if (durationMs < 1000) return locale === 'en-US' ? '<1s' : '<1秒';

--- a/app-instance/frontend/components/ui/select.tsx
+++ b/app-instance/frontend/components/ui/select.tsx
@ -88,7 +88,7 @@ const SelectContent = React.forwardRef<
        className={cn(
          'p-1',
          position === 'popper' &&
-            'h-[var(--radix-select-trigger-height)] w-full min-w-[var(--radix-select-trigger-width)]'
+            'w-full min-w-[var(--radix-select-trigger-width)]'
        )}
      >
        {children}
--- a/app-instance/frontend/lib/api.ts
+++ b/app-instance/frontend/lib/api.ts
@ -51,7 +51,7 @@ import type {
  UiMcpServerDescriptor,
  WsEvent,
 } from '@/types';
-import { getCurrentAppLocale, pickAppText } from '@/lib/i18n/core';
+import { getCurrentAppLocale, pickAppText, type AppLocale } from '@/lib/i18n/core';

 const API_URL = process.env.NEXT_PUBLIC_API_URL?.trim();
 const WS_URL = process.env.NEXT_PUBLIC_WS_URL?.trim();
@ -62,6 +62,15 @@ const REQUEST_TIMEOUT_MS = 8000;
 const OUTLOOK_REQUEST_TIMEOUT_MS = 45000;
 const SKILL_LEARNING_REQUEST_TIMEOUT_MS = 120000;

+export type PromptLocale = 'zh-Hans' | 'zh-Hant' | 'en';
+
+export function promptLocaleForAppLocale(locale: AppLocale): PromptLocale {
+  if (locale === 'zh-Hant') {
+    return 'zh-Hant';
+  }
+  return locale === 'en-US' ? 'en' : 'zh-Hans';
+}
+
 function isBrowser(): boolean {
  return typeof window !== 'undefined';
 }
@ -271,6 +280,7 @@ export async function sendMessage(
    replyToScheduledRunId?: string;
    scheduledReplyIntent?: 'revise_once' | 'update_future' | 'continue_task';
    thinkingEnabled?: boolean;
+    promptLocale?: PromptLocale;
  }
 ): Promise<{
  response?: string;
@ -281,7 +291,11 @@ export async function sendMessage(
  task_status?: string | null;
  evidence_status?: string | null;
 }> {
-  const body: Record<string, unknown> = { message, session_id: sessionId };
+  const body: Record<string, unknown> = {
+    message,
+    session_id: sessionId,
+    prompt_locale: options?.promptLocale || promptLocaleForAppLocale(getCurrentAppLocale()),
+  };
  if (attachments && attachments.length > 0) {
    body.attachments = attachments;
  }
@ -356,7 +370,11 @@ export function streamMessage(
      const res = await fetch(buildApiUrl('/api/chat/stream'), {
        method: 'POST',
        headers: authHeaders(),
-        body: JSON.stringify({ message, session_id: sessionId }),
+        body: JSON.stringify({
+          message,
+          session_id: sessionId,
+          prompt_locale: promptLocaleForAppLocale(getCurrentAppLocale()),
+        }),
        signal: controller.signal,
      });

--- a/app-instance/frontend/lib/i18n/core.test.ts
+++ b/app-instance/frontend/lib/i18n/core.test.ts
@ -0,0 +1,32 @@
+import { describe, expect, it } from 'vitest';
+
+import { isAppLocale, normalizeAppLocale, pickAppText } from '@/lib/i18n/core';
+
+describe('app locale normalization', () => {
+  it('accepts simplified Chinese, English, and traditional Chinese locales', () => {
+    expect(isAppLocale('zh-CN')).toBe(true);
+    expect(isAppLocale('en-US')).toBe(true);
+    expect(isAppLocale('zh-Hant')).toBe(true);
+  });
+
+  it('normalizes common traditional Chinese locale tags', () => {
+    expect(normalizeAppLocale('zh-TW')).toBe('zh-Hant');
+    expect(normalizeAppLocale('zh-HK')).toBe('zh-Hant');
+    expect(normalizeAppLocale('zh-Hant')).toBe('zh-Hant');
+  });
+});
+
+describe('app text picker', () => {
+  it('returns simplified Chinese text for zh-CN', () => {
+    expect(pickAppText('zh-CN', '任务状态', 'Task status')).toBe('任务状态');
+  });
+
+  it('returns English text for en-US', () => {
+    expect(pickAppText('en-US', '任务状态', 'Task status')).toBe('Task status');
+  });
+
+  it('returns traditional Chinese text for zh-Hant', () => {
+    expect(pickAppText('zh-Hant', '任务状态', 'Task status')).toBe('任務狀態');
+    expect(pickAppText('zh-Hant', '智能体结果', 'Agent results')).toBe('智慧體結果');
+  });
+});
--- a/app-instance/frontend/lib/i18n/core.ts
+++ b/app-instance/frontend/lib/i18n/core.ts
@ -1,12 +1,12 @@
 export const APP_LOCALE_COOKIE = 'beaver_locale';
 export const APP_LOCALE_STORAGE_KEY = 'beaver_locale';

-export const APP_LOCALES = ['zh-CN', 'en-US'] as const;
+export const APP_LOCALES = ['zh-CN', 'en-US', 'zh-Hant'] as const;

 export type AppLocale = (typeof APP_LOCALES)[number];

 export function isAppLocale(value: string | null | undefined): value is AppLocale {
-  return value === 'zh-CN' || value === 'en-US';
+  return value === 'zh-CN' || value === 'en-US' || value === 'zh-Hant';
 }

 export function normalizeAppLocale(value?: string | null): AppLocale {
@ -14,6 +14,14 @@ export function normalizeAppLocale(value?: string | null): AppLocale {
  if (probe.startsWith('en')) {
    return 'en-US';
  }
+  if (
+    probe === 'zh-hant' ||
+    probe.startsWith('zh-tw') ||
+    probe.startsWith('zh-hk') ||
+    probe.startsWith('zh-mo')
+  ) {
+    return 'zh-Hant';
+  }
  return 'zh-CN';
 }

@ -71,6 +79,507 @@ export function getCurrentAppLocale(): AppLocale {
  return readBrowserAppLocale();
 }

-export function pickAppText<T>(locale: AppLocale, zhValue: T, enValue: T): T {
-  return locale === 'en-US' ? enValue : zhValue;
+export function pickAppText<T>(locale: string | null | undefined, zhValue: T, enValue: T): T {
+  const appLocale = normalizeAppLocale(locale);
+  if (appLocale === 'en-US') {
+    return enValue;
+  }
+  if (appLocale === 'zh-Hant') {
+    return toTraditionalValue(zhValue);
+  }
+  return zhValue;
+}
+
+function toTraditionalValue<T>(value: T): T {
+  return typeof value === 'string' ? (toTraditionalChinese(value) as T) : value;
+}
+
+const SIMPLIFIED_TO_TRADITIONAL_PHRASES: Array<[string, string]> = [
+  ['智能体', '智慧體'],
+  ['Agent Team', 'Agent Team'],
+];
+
+const SIMPLIFIED_TO_TRADITIONAL_CHARS: Record<string, string> = {
+  个: '個',
+  为: '為',
+  么: '麼',
+  义: '義',
+  习: '習',
+  书: '書',
+  了: '了',
+  于: '於',
+  云: '雲',
+  产: '產',
+  仅: '僅',
+  从: '從',
+  仓: '倉',
+  仪: '儀',
+  们: '們',
+  优: '優',
+  会: '會',
+  传: '傳',
+  体: '體',
+  余: '餘',
+  侧: '側',
+  侦: '偵',
+  促: '促',
+  俩: '倆',
+  值: '值',
+  假: '假',
+  做: '做',
+  停: '停',
+  储: '儲',
+  像: '像',
+  儿: '兒',
+  先: '先',
+  光: '光',
+  关: '關',
+  兴: '興',
+  具: '具',
+  内: '內',
+  册: '冊',
+  写: '寫',
+  军: '軍',
+  农: '農',
+  况: '況',
+  冻: '凍',
+  净: '淨',
+  准: '準',
+  几: '幾',
+  击: '擊',
+  划: '劃',
+  则: '則',
+  创: '創',
+  初: '初',
+  删: '刪',
+  别: '別',
+  到: '到',
+  制: '製',
+  剂: '劑',
+  剩: '剩',
+  办: '辦',
+  功: '功',
+  加: '加',
+  务: '務',
+  动: '動',
+  助: '助',
+  势: '勢',
+  包: '包',
+  区: '區',
+  协: '協',
+  单: '單',
+  卖: '賣',
+  占: '佔',
+  卡: '卡',
+  历: '歷',
+  压: '壓',
+  厕: '廁',
+  厢: '廂',
+  县: '縣',
+  参: '參',
+  双: '雙',
+  发: '發',
+  变: '變',
+  叠: '疊',
+  号: '號',
+  后: '後',
+  向: '向',
+  吗: '嗎',
+  启: '啟',
+  员: '員',
+  命: '命',
+  咨: '諮',
+  哑: '啞',
+  响: '響',
+  唤: '喚',
+  问: '問',
+  單: '單',
+  喂: '餵',
+  器: '器',
+  团: '團',
+  园: '園',
+  困: '困',
+  图: '圖',
+  场: '場',
+  块: '塊',
+  坏: '壞',
+  址: '址',
+  坚: '堅',
+  坛: '壇',
+  型: '型',
+  垃: '垃',
+  域: '域',
+  堆: '堆',
+  填: '填',
+  增: '增',
+  墙: '牆',
+  声: '聲',
+  处: '處',
+  备: '備',
+  复: '復',
+  够: '夠',
+  头: '頭',
+  奖: '獎',
+  好: '好',
+  如: '如',
+  始: '始',
+  委: '委',
+  存: '存',
+  学: '學',
+  宁: '寧',
+  它: '它',
+  安: '安',
+  完: '完',
+  实: '實',
+  审: '審',
+  客: '客',
+  宪: '憲',
+  宽: '寬',
+  对: '對',
+  导: '導',
+  将: '將',
+  尔: '爾',
+  尝: '嘗',
+  层: '層',
+  属: '屬',
+  岁: '歲',
+  岛: '島',
+  州: '州',
+  工: '工',
+  币: '幣',
+  师: '師',
+  帐: '帳',
+  带: '帶',
+  帮: '幫',
+  干: '乾',
+  并: '並',
+  广: '廣',
+  庆: '慶',
+  库: '庫',
+  应: '應',
+  废: '廢',
+  开: '開',
+  异: '異',
+  弃: '棄',
+  张: '張',
+  强: '強',
+  归: '歸',
+  当: '當',
+  录: '錄',
+  彻: '徹',
+  径: '徑',
+  待: '待',
+  循: '循',
+  忆: '憶',
+  志: '誌',
+  忧: '憂',
+  念: '念',
+  态: '態',
+  总: '總',
+  恢: '恢',
+  息: '息',
+  您: '您',
+  情: '情',
+  想: '想',
+  意: '意',
+  愿: '願',
+  戏: '戲',
+  战: '戰',
+  户: '戶',
+  执: '執',
+  扩: '擴',
+  扫: '掃',
+  扬: '揚',
+  批: '批',
+  找: '找',
+  技: '技',
+  报: '報',
+  护: '護',
+  抽: '抽',
+  担: '擔',
+  拥: '擁',
+  择: '擇',
+  按: '按',
+  挥: '揮',
+  换: '換',
+  损: '損',
+  据: '據',
+  授: '授',
+  掉: '掉',
+  接: '接',
+  控: '控',
+  推: '推',
+  提: '提',
+  插: '插',
+  揭: '揭',
+  搜: '搜',
+  携: '攜',
+  摄: '攝',
+  摘: '摘',
+  播: '播',
+  操: '操',
+  支: '支',
+  收: '收',
+  改: '改',
+  放: '放',
+  效: '效',
+  数: '數',
+  文: '文',
+  断: '斷',
+  新: '新',
+  无: '無',
+  时: '時',
+  明: '明',
+  显: '顯',
+  智: '智',
+  暂: '暫',
+  更: '更',
+  替: '替',
+  术: '術',
+  机: '機',
+  权: '權',
+  条: '條',
+  来: '來',
+  极: '極',
+  构: '構',
+  标: '標',
+  栏: '欄',
+  树: '樹',
+  样: '樣',
+  核: '核',
+  案: '案',
+  档: '檔',
+  检: '檢',
+  楼: '樓',
+  次: '次',
+  款: '款',
+  步: '步',
+  残: '殘',
+  段: '段',
+  毕: '畢',
+  气: '氣',
+  汇: '匯',
+  汉: '漢',
+  没: '沒',
+  法: '法',
+  注: '註',
+  泄: '洩',
+  测: '測',
+  浏: '瀏',
+  消: '消',
+  涉: '涉',
+  涨: '漲',
+  润: '潤',
+  添: '添',
+  清: '清',
+  渠: '渠',
+  渲: '渲',
+  温: '溫',
+  滚: '滾',
+  满: '滿',
+  漏: '漏',
+  演: '演',
+  点: '點',
+  烦: '煩',
+  热: '熱',
+  然: '然',
+  照: '照',
+  爱: '愛',
+  父: '父',
+  片: '片',
+  版: '版',
+  状: '狀',
+  独: '獨',
+  环: '環',
+  现: '現',
+  理: '理',
+  画: '畫',
+  畅: '暢',
+  疗: '療',
+  登: '登',
+  监: '監',
+  盘: '盤',
+  码: '碼',
+  础: '礎',
+  确: '確',
+  碍: '礙',
+  礼: '禮',
+  离: '離',
+  种: '種',
+  称: '稱',
+  稳: '穩',
+  窗: '窗',
+  笔: '筆',
+  签: '簽',
+  简: '簡',
+  算: '算',
+  管: '管',
+  类: '類',
+  粘: '黏',
+  精: '精',
+  系: '系',
+  级: '級',
+  线: '線',
+  组: '組',
+  细: '細',
+  终: '終',
+  经: '經',
+  结: '結',
+  绝: '絕',
+  统: '統',
+  维: '維',
+  缓: '緩',
+  编: '編',
+  缩: '縮',
+  缺: '缺',
+  网: '網',
+  置: '置',
+  联: '聯',
+  聊: '聊',
+  肃: '肅',
+  背: '背',
+  能: '能',
+  脚: '腳',
+  脱: '脫',
+  脑: '腦',
+  自动: '自動',
+  舰: '艦',
+  艺: '藝',
+  节: '節',
+  范: '範',
+  荐: '薦',
+  获: '獲',
+  营: '營',
+  落: '落',
+  著: '著',
+  藏: '藏',
+  虑: '慮',
+  虚: '虛',
+  虽: '雖',
+  行: '行',
+  补: '補',
+  表: '表',
+  装: '裝',
+  规: '規',
+  视: '視',
+  觉: '覺',
+  览: '覽',
+  计: '計',
+  订: '訂',
+  认: '認',
+  议: '議',
+  讯: '訊',
+  记: '記',
+  讲: '講',
+  许: '許',
+  论: '論',
+  设: '設',
+  访: '訪',
+  证: '證',
+  评: '評',
+  识: '識',
+  诉: '訴',
+  试: '試',
+  话: '話',
+  详: '詳',
+  语: '語',
+  误: '誤',
+  请: '請',
+  读: '讀',
+  调: '調',
+  谈: '談',
+  谢: '謝',
+  谷: '谷',
+  账: '帳',
+  负: '負',
+  责: '責',
+  败: '敗',
+  货: '貨',
+  质: '質',
+  资: '資',
+  赃: '贓',
+  起: '起',
+  超: '超',
+  跃: '躍',
+  路: '路',
+  踪: '蹤',
+  车: '車',
+  轮: '輪',
+  软: '軟',
+  载: '載',
+  辑: '輯',
+  输: '輸',
+  边: '邊',
+  达: '達',
+  过: '過',
+  还: '還',
+  这: '這',
+  进: '進',
+  远: '遠',
+  连: '連',
+  迟: '遲',
+  适: '適',
+  选: '選',
+  递: '遞',
+  通: '通',
+  逻: '邏',
+  遗: '遺',
+  遥: '遙',
+  邀: '邀',
+  邮: '郵',
+  部: '部',
+  配: '配',
+  释: '釋',
+  重: '重',
+  针: '針',
+  钥: '鑰',
+  钟: '鐘',
+  钮: '鈕',
+  钱: '錢',
+  链: '鏈',
+  错: '錯',
+  键: '鍵',
+  镜: '鏡',
+  长: '長',
+  门: '門',
+  闭: '閉',
+  间: '間',
+  队: '隊',
+  阶: '階',
+  阳: '陽',
+  阴: '陰',
+  陈: '陳',
+  际: '際',
+  隐: '隱',
+  难: '難',
+  雏: '雛',
+  需: '需',
+  面: '面',
+  页: '頁',
+  项: '項',
+  顺: '順',
+  须: '須',
+  预: '預',
+  题: '題',
+  颜: '顏',
+  风: '風',
+  飞: '飛',
+  馆: '館',
+  验: '驗',
+  高: '高',
+  鱼: '魚',
+  鲜: '鮮',
+  鸟: '鳥',
+  麦: '麥',
+  黄: '黃',
+};
+
+export function toTraditionalChinese(value: string): string {
+  let converted = value;
+  for (const [source, target] of SIMPLIFIED_TO_TRADITIONAL_PHRASES) {
+    converted = converted.split(source).join(target);
+  }
+  return Array.from(converted)
+    .map((char) => SIMPLIFIED_TO_TRADITIONAL_CHARS[char] ?? char)
+    .join('');
 }
--- a/app-instance/frontend/lib/task-timeline-view.test.ts
+++ b/app-instance/frontend/lib/task-timeline-view.test.ts
@ -40,9 +40,11 @@ describe('buildTaskTimelineView', () => {
    const view = buildTaskTimelineView({
      task: task(),
      liveEvents,
+      locale: 'en-US',
    });

    expect(view?.cards.map((card) => card.type)).toEqual(['task_created', 'plan']);
+    expect(view?.cards.map((card) => card.title)).toEqual(['Task created', 'Execution plan']);
    expect(view?.process.events.map((event) => event.event_id)).toEqual(['plan']);
  });

--- a/app-instance/frontend/lib/task-timeline-view.ts
+++ b/app-instance/frontend/lib/task-timeline-view.ts
@ -1,9 +1,11 @@
 import { selectTaskProcess, type SelectTaskProcessInput, type TaskProcessSelection } from '@/lib/task-process';
 import { buildTaskTimelineCards } from '@/lib/task-timeline';
+import type { AppLocale } from '@/lib/i18n/core';
 import type { BackendTask, TaskTimelineCard } from '@/types';

 export type BuildTaskTimelineViewInput = Omit<SelectTaskProcessInput, 'task'> & {
  task: BackendTask | null;
+  locale?: AppLocale | string;
 };

 export type TaskTimelineView = {
@ -16,6 +18,7 @@ export function buildTaskTimelineView({
  liveRuns,
  liveEvents,
  liveArtifacts,
+  locale,
 }: BuildTaskTimelineViewInput): TaskTimelineView | null {
  if (!task) return null;

@ -32,6 +35,7 @@ export function buildTaskTimelineView({
      processRuns: process.runs,
      processEvents: process.events,
      processArtifacts: process.artifacts,
+      locale,
    }),
  };
 }
--- a/app-instance/frontend/lib/task-timeline.test.ts
+++ b/app-instance/frontend/lib/task-timeline.test.ts
@ -143,6 +143,48 @@ describe('buildTaskTimelineCards', () => {
    expect(cards[6].relatedArtifactIds).toEqual(['artifact-summary']);
  });

+  it('localizes generated milestone titles for English and Traditional Chinese', () => {
+    const task = makeTask();
+    const processEvents: ProcessEvent[] = [
+      {
+        event_id: 'evt-plan',
+        run_id: 'run-main',
+        parent_run_id: null,
+        kind: 'task_planned',
+        actor_type: 'agent',
+        actor_id: 'planner',
+        actor_name: 'Task Planner',
+        text: 'Plan created.',
+        created_at: '2026-05-26T10:01:00.000Z',
+      },
+      {
+        event_id: 'evt-tool-start',
+        run_id: 'run-main',
+        parent_run_id: null,
+        kind: 'tool_call_started',
+        actor_type: 'mcp',
+        actor_id: 'user_files_list',
+        actor_name: 'user_files_list',
+        text: 'Calling tool: user_files_list.',
+        created_at: '2026-05-26T10:02:00.000Z',
+      },
+    ];
+
+    const englishCards = buildTaskTimelineCards({ task, processEvents, locale: 'en-US' });
+    const traditionalCards = buildTaskTimelineCards({ task, processEvents, locale: 'zh-Hant' });
+
+    expect(englishCards.map((card) => card.title)).toEqual([
+      'Task created',
+      'Execution plan',
+      'Calling tool: user_files_list',
+    ]);
+    expect(traditionalCards.map((card) => card.title)).toEqual([
+      '任務已創建',
+      '執行計劃',
+      '調用工具：user_files_list',
+    ]);
+  });
+
  it('appends result and acceptance cards for closed tasks with feedback', () => {
    const task = makeTask({
      is_open: false,
--- a/app-instance/frontend/lib/task-timeline.ts
+++ b/app-instance/frontend/lib/task-timeline.ts
@ -6,12 +6,14 @@ import type {
  TaskTimelineCard,
  TaskTimelineCardType,
 } from '@/types';
+import { getCurrentAppLocale, pickAppText, type AppLocale } from '@/lib/i18n/core';

 export type BuildTaskTimelineCardsInput = {
  task: BackendTask;
  processRuns?: ProcessRun[];
  processEvents?: ProcessEvent[];
  processArtifacts?: ProcessArtifact[];
+  locale?: AppLocale | string;
 };

 const TIMELINE_CARD_TYPES = new Set<TaskTimelineCardType>([
@ -110,36 +112,40 @@ function cardTypeForEvent(event: ProcessEvent): TaskTimelineCardType | null {
  }
 }

-function titleForCard(type: TaskTimelineCardType, actorName?: string): string {
+function titleForCard(type: TaskTimelineCardType, actorName?: string, locale: AppLocale | string = getCurrentAppLocale()): string {
  switch (type) {
    case 'task_created':
-      return '任务已创建';
+      return pickAppText(locale, '任务已创建', 'Task created');
    case 'plan':
-      return '执行计划';
+      return pickAppText(locale, '执行计划', 'Execution plan');
    case 'skill':
-      return '选择 Skill';
+      return pickAppText(locale, '选择 Skill', 'Skill selected');
    case 'tool_call':
-      return actorName ? `调用工具：${actorName}` : '调用工具';
+      return actorName
+        ? pickAppText(locale, `调用工具：${actorName}`, `Calling tool: ${actorName}`)
+        : pickAppText(locale, '调用工具', 'Tool call');
    case 'tool_result':
-      return actorName ? `工具结果：${actorName}` : '工具结果';
+      return actorName
+        ? pickAppText(locale, `工具结果：${actorName}`, `Tool result: ${actorName}`)
+        : pickAppText(locale, '工具结果', 'Tool result');
    case 'next_step':
-      return '下一步';
+      return pickAppText(locale, '下一步', 'Next step');
    case 'agent_team':
-      return '启动 Agent Team';
+      return pickAppText(locale, '启动 Agent Team', 'Agent team started');
    case 'agent_progress':
-      return actorName || 'Agent 进展';
+      return actorName || pickAppText(locale, 'Agent 进展', 'Agent progress');
    case 'agent_handoff':
-      return 'Agent 交接';
+      return pickAppText(locale, 'Agent 交接', 'Agent handoff');
    case 'artifact':
-      return '生成产物';
+      return pickAppText(locale, '生成产物', 'Artifact generated');
    case 'error':
-      return '执行遇到问题';
+      return pickAppText(locale, '执行遇到问题', 'Execution issue');
    case 'result':
-      return '本轮结果';
+      return pickAppText(locale, '本轮结果', 'Run result');
    case 'result_history':
-      return '历史结果版本';
+      return pickAppText(locale, '历史结果版本', 'Previous result versions');
    case 'acceptance':
-      return '任务验收';
+      return pickAppText(locale, '任务验收', 'Task acceptance');
  }
 }

@ -286,7 +292,12 @@ function buildToolResultStatusByCall(processEvents: ProcessEvent[]): Map<string,
  return statuses;
 }

-function buildResultHistoryCard(task: BackendTask, resultCards: TaskTimelineCard[], acceptanceCards: TaskTimelineCard[]): TaskTimelineCard {
+function buildResultHistoryCard(
+  task: BackendTask,
+  resultCards: TaskTimelineCard[],
+  acceptanceCards: TaskTimelineCard[],
+  locale: AppLocale | string,
+): TaskTimelineCard {
  const versions = resultCards.map((resultCard) => {
    const acceptanceCard = acceptanceCards
      .filter((card) => card.runId === resultCard.runId)
@ -307,14 +318,18 @@ function buildResultHistoryCard(task: BackendTask, resultCards: TaskTimelineCard
    id: `${task.task_id}:result-history`,
    taskId: task.task_id,
    type: 'result_history',
-    title: titleForCard('result_history'),
-    summary: `${resultCards.length} 历史结果版本`,
+    title: titleForCard('result_history', undefined, locale),
+    summary: pickAppText(
+      locale,
+      `${resultCards.length} 历史结果版本`,
+      `${resultCards.length} previous result ${resultCards.length === 1 ? 'version' : 'versions'}`,
+    ),
    createdAt: resultCards[0]?.createdAt ?? task.created_at,
    details: { versions },
  };
 }

-function collapseHistoricalResults(task: BackendTask, cards: TaskTimelineCard[]): TaskTimelineCard[] {
+function collapseHistoricalResults(task: BackendTask, cards: TaskTimelineCard[], locale: AppLocale | string): TaskTimelineCard[] {
  const resultCards = cards.filter((card) => card.type === 'result');
  if (resultCards.length <= 1) return cards;

@ -334,7 +349,7 @@ function collapseHistoricalResults(task: BackendTask, cards: TaskTimelineCard[])
    .filter((card) => card.type === 'acceptance' && oldRunIds.has(card.runId))
    .sort((a, b) => cardTime(a) - cardTime(b));
  const foldedIds = new Set([...oldResults, ...oldAcceptances].map((card) => card.id));
-  const historyCard = buildResultHistoryCard(task, oldResults, oldAcceptances);
+  const historyCard = buildResultHistoryCard(task, oldResults, oldAcceptances, locale);
  const firstOldResultIndex = cards.findIndex((card) => card.id === oldResults[0].id);
  const output: TaskTimelineCard[] = [];

@ -352,6 +367,7 @@ function collapseHistoricalResults(task: BackendTask, cards: TaskTimelineCard[])

 export function buildTaskTimelineCards(input: BuildTaskTimelineCardsInput): TaskTimelineCard[] {
  const { task } = input;
+  const locale = input.locale ?? getCurrentAppLocale();
  const processRuns = input.processRuns ?? task.process_runs ?? [];
  const processEvents = input.processEvents ?? task.process_events ?? [];
  const processArtifacts = input.processArtifacts ?? task.process_artifacts ?? [];
@ -365,7 +381,7 @@ export function buildTaskTimelineCards(input: BuildTaskTimelineCardsInput): Task
      id: `${task.task_id}:created`,
      taskId: task.task_id,
      type: 'task_created',
-      title: titleForCard('task_created'),
+      title: titleForCard('task_created', undefined, locale),
      summary: firstString(task.short_title, task.description, task.goal),
      actorName: task.creator,
      status: task.status,
@ -396,7 +412,7 @@ export function buildTaskTimelineCards(input: BuildTaskTimelineCardsInput): Task
      runId: event.run_id,
      parentRunId: event.parent_run_id,
      type,
-      title: titleForCard(type, event.actor_name),
+      title: titleForCard(type, event.actor_name, locale),
      summary: type === 'result' ? resultSummaryForEvent(task, event) : summaryForEvent(event),
      actorName: event.actor_name,
      status:
@ -418,7 +434,7 @@ export function buildTaskTimelineCards(input: BuildTaskTimelineCardsInput): Task
      runId: run.run_id,
      parentRunId: run.parent_run_id,
      type: 'agent_progress',
-      title: titleForCard('agent_progress', run.actor_name),
+      title: titleForCard('agent_progress', run.actor_name, locale),
      summary: firstString(run.summary, run.title),
      actorName: run.actor_name,
      status: run.status,
@ -435,7 +451,7 @@ export function buildTaskTimelineCards(input: BuildTaskTimelineCardsInput): Task
      runId: artifact.run_id,
      parentRunId: run?.parent_run_id,
      type: 'artifact',
-      title: titleForCard('artifact'),
+      title: titleForCard('artifact', undefined, locale),
      summary: firstString(artifact.title),
      actorName: artifact.actor_name,
      createdAt: artifact.created_at,
@ -454,7 +470,7 @@ export function buildTaskTimelineCards(input: BuildTaskTimelineCardsInput): Task
      taskId: task.task_id,
      runId: lastItem(task.run_ids),
      type: 'result',
-      title: titleForCard('result'),
+      title: titleForCard('result', undefined, locale),
      summary: fallbackResultSummary(task),
      status: task.status,
      createdAt: task.closed_at ?? task.updated_at ?? task.created_at,
@ -473,7 +489,7 @@ export function buildTaskTimelineCards(input: BuildTaskTimelineCardsInput): Task
      taskId: task.task_id,
      runId,
      type: 'acceptance',
-      title: titleForCard('acceptance'),
+      title: titleForCard('acceptance', undefined, locale),
      summary: feedbackSummary(feedback),
      status: firstString(feedback.acceptance_type),
      createdAt,
@ -486,5 +502,5 @@ export function buildTaskTimelineCards(input: BuildTaskTimelineCardsInput): Task
    .sort(compareCardsByCreatedAt)
    .map(({ card }) => card);

-  return collapseHistoricalResults(task, sortedCards);
+  return collapseHistoricalResults(task, sortedCards, locale);
 }
--- a/app-instance/frontend/types/index.ts
+++ b/app-instance/frontend/types/index.ts
@ -985,6 +985,15 @@ export interface SkillDraftEvalReport {
  cases: Array<Record<string, unknown>>;
  status: string;
  created_at: string;
+  eval_version?: string;
+  mode?: 'heuristic' | 'replay' | string;
+  execution_coverage?: number;
+  surrogate_coverage?: number;
+  blocked_coverage?: number;
+  confidence?: 'low' | 'medium' | 'high' | string;
+  case_reports?: Array<Record<string, unknown>>;
+  tool_mode_summary?: Record<string, unknown>;
+  preservation_report?: Record<string, unknown> | null;
 }

 export interface SkillDraft {
--- a/deploy-control/.env.example
+++ b/deploy-control/.env.example
@ -16,6 +16,7 @@ APP_INSTANCE_API_BASE=
 DEFAULT_AUTHZ_BASE_URL=http://beaver-authz-service:19090
 DEFAULT_AUTHZ_OUTLOOK_MCP_URL=
 DEFAULT_OUTLOOK_MCP_SERVER_ID=outlook_mcp
+DEFAULT_OUTLOOK_MCP_CALL_TIMEOUT_SECONDS=60
 DEFAULT_USER_FILES_MAX_UPLOAD_BYTES=5368709120
 DEFAULT_EXTERNAL_CONNECTOR_BASE_URL=http://external-connector:8787
 DEFAULT_EXTERNAL_CONNECTOR_TOKEN=
--- a/deploy-control/README.md
+++ b/deploy-control/README.md
@ -20,6 +20,7 @@
 - `DEFAULT_AUTHZ_BASE_URL`
 - `DEFAULT_AUTHZ_OUTLOOK_MCP_URL`
 - `DEFAULT_OUTLOOK_MCP_SERVER_ID`
+- `DEFAULT_OUTLOOK_MCP_CALL_TIMEOUT_SECONDS`
 - `DEPLOY_PUBLIC_BASE_DOMAIN`
 - `DEPLOY_PUBLIC_PORT`
 - `DEPLOY_PUBLIC_SCHEME`
@ -42,6 +43,7 @@ http://<instance-slug>.localhost:8088
 ```bash
 DEFAULT_AUTHZ_OUTLOOK_MCP_URL=http://10.6.80.29:8000/mcp
 DEFAULT_OUTLOOK_MCP_SERVER_ID=outlook_mcp
+DEFAULT_OUTLOOK_MCP_CALL_TIMEOUT_SECONDS=60
 ```

 这样 `deploy-control` 创建的新实例会自动写入一条默认 MCP server 配置，并默认使用 `oauth_backend_token` + `mcp:<server_id>` 的 audience。
--- a/deploy-control/server.py
+++ b/deploy-control/server.py
@ -41,6 +41,9 @@ DEFAULT_AUTHZ_BASE_URL = os.environ.get("DEFAULT_AUTHZ_BASE_URL", "").strip()
 DEFAULT_AUTHZ_INTERNAL_TOKEN = os.environ.get("DEFAULT_AUTHZ_INTERNAL_TOKEN", "").strip()
 DEFAULT_AUTHZ_OUTLOOK_MCP_URL = os.environ.get("DEFAULT_AUTHZ_OUTLOOK_MCP_URL", "").strip()
 DEFAULT_OUTLOOK_MCP_SERVER_ID = os.environ.get("DEFAULT_OUTLOOK_MCP_SERVER_ID", "outlook_mcp").strip() or "outlook_mcp"
+DEFAULT_OUTLOOK_MCP_CALL_TIMEOUT_SECONDS = (
+    os.environ.get("DEFAULT_OUTLOOK_MCP_CALL_TIMEOUT_SECONDS", "60").strip() or "60"
+)
 DEFAULT_USER_FILES_MAX_UPLOAD_BYTES = os.environ.get("DEFAULT_USER_FILES_MAX_UPLOAD_BYTES", "").strip()
 DEFAULT_EXTERNAL_CONNECTOR_BASE_URL = os.environ.get(
    "DEFAULT_EXTERNAL_CONNECTOR_BASE_URL",
@ -279,6 +282,7 @@ def create_or_get_instance(payload: dict[str, Any]) -> dict[str, Any]:
        if authz_outlook_mcp_url:
            command.extend(["--authz-outlook-mcp-url", authz_outlook_mcp_url])
            command.extend(["--outlook-mcp-server-id", DEFAULT_OUTLOOK_MCP_SERVER_ID])
+            command.extend(["--outlook-mcp-call-timeout-seconds", DEFAULT_OUTLOOK_MCP_CALL_TIMEOUT_SECONDS])
        if DEFAULT_USER_FILES_MAX_UPLOAD_BYTES:
            command.extend(["--user-files-max-upload-bytes", DEFAULT_USER_FILES_MAX_UPLOAD_BYTES])
        if DEFAULT_EXTERNAL_CONNECTOR_BASE_URL:
--- a/deploy-control/tests/test_connector_instance_config.py
+++ b/deploy-control/tests/test_connector_instance_config.py
@ -35,6 +35,8 @@ def test_new_instance_receives_external_connector_configuration(monkeypatch) ->
    monkeypatch.setattr(server, "DEFAULT_EXTERNAL_CONNECTOR_TOKEN", "connector-token")
    monkeypatch.setattr(server, "DEFAULT_BEAVER_BRIDGE_TOKEN", "bridge-token")
    monkeypatch.setattr(server, "DEFAULT_INITIAL_SKILLS_DIR", "/srv/beaver/skills")
+    monkeypatch.setattr(server, "DEFAULT_AUTHZ_OUTLOOK_MCP_URL", "http://bw-outlook-mcp:8000/mcp")
+    monkeypatch.setattr(server, "DEFAULT_OUTLOOK_MCP_CALL_TIMEOUT_SECONDS", "60")

    def capture_command(args: list[str], **_kwargs: Any) -> str:
        commands.append(args)
@ -55,4 +57,5 @@ def test_new_instance_receives_external_connector_configuration(monkeypatch) ->
    assert command[command.index("--external-connector-token") + 1] == "connector-token"
    assert command[command.index("--bridge-token") + 1] == "bridge-token"
    assert command[command.index("--initial-skills-dir") + 1] == "/srv/beaver/skills"
+    assert command[command.index("--outlook-mcp-call-timeout-seconds") + 1] == "60"
    assert result["created"] is True
--- a/docs/presentations/skill-replay-eval/assets/animations/animations.css
+++ b/docs/presentations/skill-replay-eval/assets/animations/animations.css
@ -0,0 +1,138 @@
+/* html-ppt :: animations.css
+ * Apply by adding class="anim-<name>" or data-anim="<name>".
+ * Durations are deliberately snappy; tweak --anim-dur per element.
+ */
+:root{--anim-dur:.7s;--anim-ease:cubic-bezier(.4,0,.2,1)}
+
+/* ---------- FADE DIRECTIONALS ---------- */
+@keyframes kf-fade-up{from{opacity:0;transform:translateY(32px)}to{opacity:1;transform:none}}
+@keyframes kf-fade-down{from{opacity:0;transform:translateY(-32px)}to{opacity:1;transform:none}}
+@keyframes kf-fade-left{from{opacity:0;transform:translateX(-40px)}to{opacity:1;transform:none}}
+@keyframes kf-fade-right{from{opacity:0;transform:translateX(40px)}to{opacity:1;transform:none}}
+.anim-fade-up{animation:kf-fade-up var(--anim-dur) var(--anim-ease) both}
+.anim-fade-down{animation:kf-fade-down var(--anim-dur) var(--anim-ease) both}
+.anim-fade-left{animation:kf-fade-left var(--anim-dur) var(--anim-ease) both}
+.anim-fade-right{animation:kf-fade-right var(--anim-dur) var(--anim-ease) both}
+
+/* ---------- RISE / DROP / ZOOM / BLUR / GLITCH ---------- */
+@keyframes kf-rise{from{opacity:0;transform:translateY(60px) scale(.97);filter:blur(6px)}to{opacity:1;transform:none;filter:none}}
+@keyframes kf-drop{from{opacity:0;transform:translateY(-60px) scale(.97)}to{opacity:1;transform:none}}
+@keyframes kf-zoom{0%{opacity:0;transform:scale(.6)}60%{transform:scale(1.04)}100%{opacity:1;transform:scale(1)}}
+@keyframes kf-blur{from{opacity:0;filter:blur(18px)}to{opacity:1;filter:none}}
+@keyframes kf-glitch{0%{opacity:0;transform:translateX(0);clip-path:inset(0 0 0 0)}
+  20%{opacity:1;transform:translateX(-6px);clip-path:inset(20% 0 30% 0)}
+  40%{transform:translateX(4px);clip-path:inset(50% 0 10% 0)}
+  60%{transform:translateX(-3px);clip-path:inset(10% 0 60% 0)}
+  80%{transform:translateX(2px);clip-path:inset(0 0 0 0)}
+  100%{opacity:1;transform:none}}
+.anim-rise-in{animation:kf-rise .9s var(--anim-ease) both}
+.anim-drop-in{animation:kf-drop .8s var(--anim-ease) both}
+.anim-zoom-pop{animation:kf-zoom .7s cubic-bezier(.22,1.3,.36,1) both}
+.anim-blur-in{animation:kf-blur .8s var(--anim-ease) both}
+.anim-glitch-in{animation:kf-glitch .8s steps(5,end) both}
+
+/* ---------- TYPEWRITER ---------- */
+.anim-typewriter{display:inline-block;overflow:hidden;white-space:nowrap;border-right:2px solid currentColor;
+  width:0;animation:kf-type 2.4s steps(40,end) forwards, kf-caret 1s step-end infinite}
+@keyframes kf-type{to{width:100%}}
+@keyframes kf-caret{50%{border-color:transparent}}
+
+/* ---------- GLOW / SHIMMER / GRADIENT-FLOW ---------- */
+@keyframes kf-neon{0%,100%{text-shadow:0 0 8px var(--accent),0 0 20px var(--accent)}
+  50%{text-shadow:0 0 16px var(--accent),0 0 40px var(--accent),0 0 80px var(--accent)}}
+.anim-neon-glow{animation:kf-neon 2s ease-in-out infinite}
+
+.anim-shimmer-sweep{position:relative;overflow:hidden}
+.anim-shimmer-sweep::after{content:"";position:absolute;inset:0;
+  background:linear-gradient(110deg,transparent 40%,rgba(255,255,255,.55) 50%,transparent 60%);
+  transform:translateX(-100%);animation:kf-shimmer 2.4s var(--anim-ease) infinite}
+@keyframes kf-shimmer{to{transform:translateX(100%)}}
+
+.anim-gradient-flow{background:linear-gradient(90deg,var(--accent),var(--accent-2,var(--accent)),var(--accent-3,var(--accent)),var(--accent));
+  background-size:300% 100%;-webkit-background-clip:text;background-clip:text;color:transparent;-webkit-text-fill-color:transparent;
+  animation:kf-gradflow 4s linear infinite}
+@keyframes kf-gradflow{to{background-position:300% 0}}
+
+/* ---------- STAGGER LIST ---------- */
+.anim-stagger-list > *{opacity:0;animation:kf-rise .65s var(--anim-ease) both}
+.anim-stagger-list > *:nth-child(1){animation-delay:.05s}
+.anim-stagger-list > *:nth-child(2){animation-delay:.15s}
+.anim-stagger-list > *:nth-child(3){animation-delay:.25s}
+.anim-stagger-list > *:nth-child(4){animation-delay:.35s}
+.anim-stagger-list > *:nth-child(5){animation-delay:.45s}
+.anim-stagger-list > *:nth-child(6){animation-delay:.55s}
+.anim-stagger-list > *:nth-child(7){animation-delay:.65s}
+.anim-stagger-list > *:nth-child(8){animation-delay:.75s}
+.anim-stagger-list > *:nth-child(n+9){animation-delay:.85s}
+
+/* ---------- COUNTER-UP (JS-driven, marker class only) ---------- */
+.counter{font-variant-numeric:tabular-nums}
+
+/* ---------- SVG PATH DRAW ---------- */
+.anim-path-draw path,.anim-path-draw line,.anim-path-draw polyline,.anim-path-draw circle,.anim-path-draw rect{
+  stroke-dasharray:1000;stroke-dashoffset:1000;animation:kf-draw 2s var(--anim-ease) forwards}
+@keyframes kf-draw{to{stroke-dashoffset:0}}
+
+/* ---------- PARALLAX TILT (hover) ---------- */
+.anim-parallax-tilt{transform-style:preserve-3d;transition:transform .4s var(--anim-ease)}
+.anim-parallax-tilt:hover{transform:perspective(900px) rotateX(6deg) rotateY(-8deg) translateZ(10px)}
+
+/* ---------- CARD FLIP 3D ---------- */
+@keyframes kf-flip{from{transform:perspective(1200px) rotateY(-90deg);opacity:0}
+  to{transform:perspective(1200px) rotateY(0);opacity:1}}
+.anim-card-flip-3d{animation:kf-flip .9s var(--anim-ease) both;transform-style:preserve-3d;backface-visibility:hidden}
+
+/* ---------- CUBE ROTATE 3D ---------- */
+@keyframes kf-cube{from{transform:perspective(1200px) rotateX(20deg) rotateY(-90deg) translateZ(-200px);opacity:0}
+  to{transform:perspective(1200px) rotateX(0) rotateY(0) translateZ(0);opacity:1}}
+.anim-cube-rotate-3d{animation:kf-cube 1s var(--anim-ease) both}
+
+/* ---------- PAGE TURN 3D ---------- */
+@keyframes kf-pageturn{from{transform:perspective(1600px) rotateY(-85deg);transform-origin:left center;opacity:0}
+  to{transform:perspective(1600px) rotateY(0);opacity:1}}
+.anim-page-turn-3d{animation:kf-pageturn 1s var(--anim-ease) both;transform-origin:left center}
+
+/* ---------- PERSPECTIVE ZOOM ---------- */
+@keyframes kf-pzoom{from{opacity:0;transform:perspective(1400px) translateZ(-400px) rotateX(12deg)}
+  to{opacity:1;transform:none}}
+.anim-perspective-zoom{animation:kf-pzoom 1s var(--anim-ease) both}
+
+/* ---------- MARQUEE SCROLL ---------- */
+.anim-marquee-scroll{display:flex;gap:48px;white-space:nowrap;animation:kf-marquee 20s linear infinite}
+@keyframes kf-marquee{from{transform:translateX(0)}to{transform:translateX(-50%)}}
+
+/* ---------- KEN BURNS ---------- */
+@keyframes kf-kenburns{0%{transform:scale(1) translate(0,0)}100%{transform:scale(1.15) translate(-2%,-1%)}}
+.anim-kenburns{animation:kf-kenburns 14s ease-in-out infinite alternate}
+
+/* ---------- CONFETTI BURST (pseudo — pure CSS sparkles) ---------- */
+.anim-confetti-burst{position:relative}
+.anim-confetti-burst::before,.anim-confetti-burst::after{
+  content:"";position:absolute;top:50%;left:50%;width:8px;height:8px;border-radius:50%;
+  background:var(--accent);box-shadow:
+    20px -30px 0 var(--accent-2,var(--accent)),-25px -20px 0 var(--accent-3,var(--accent)),
+    30px 20px 0 var(--good,#1aaf6c),-30px 25px 0 var(--warn,#f5a524),
+    40px -10px 0 var(--bad,#e0445a),-45px 0 0 var(--accent),
+    10px 40px 0 var(--accent-2,var(--accent)),-15px -40px 0 var(--accent-3,var(--accent));
+  opacity:0;animation:kf-confetti 1.2s var(--anim-ease) forwards}
+.anim-confetti-burst::after{animation-delay:.15s;transform:rotate(45deg)}
+@keyframes kf-confetti{0%{opacity:0;transform:scale(.2)}30%{opacity:1}100%{opacity:0;transform:scale(2.2)}}
+
+/* ---------- SPOTLIGHT ---------- */
+@keyframes kf-spot{0%{clip-path:circle(0% at 50% 50%)}100%{clip-path:circle(140% at 50% 50%)}}
+.anim-spotlight{animation:kf-spot 1.1s var(--anim-ease) both}
+
+/* ---------- MORPH SHAPE (SVG) ---------- */
+.anim-morph-shape path{animation:kf-morph 6s ease-in-out infinite alternate}
+@keyframes kf-morph{0%{d:path("M60,120 Q120,20 180,120 T300,120")}
+  100%{d:path("M60,120 Q120,220 180,120 T300,120")}}
+
+/* ---------- RIPPLE REVEAL ---------- */
+@keyframes kf-ripple{0%{clip-path:circle(0% at 20% 80%);opacity:.4}
+  100%{clip-path:circle(160% at 20% 80%);opacity:1}}
+.anim-ripple-reveal{animation:kf-ripple 1.2s var(--anim-ease) both}
+
+/* reduced motion */
+@media (prefers-reduced-motion: reduce){
+  [class*="anim-"]{animation:none!important;transition:none!important}
+}
--- a/docs/presentations/skill-replay-eval/assets/base.css
+++ b/docs/presentations/skill-replay-eval/assets/base.css
@ -0,0 +1,150 @@
+/* html-ppt :: base.css — reset + shared tokens + layout primitives */
+/* Default tokens. Themes in assets/themes/*.css override the :root block. */
+:root {
+  --bg: #ffffff;
+  --bg-soft: #f7f7f8;
+  --surface: #ffffff;
+  --surface-2: #f2f2f4;
+  --border: rgba(0,0,0,.08);
+  --border-strong: rgba(0,0,0,.16);
+  --text-1: #111216;
+  --text-2: #55596a;
+  --text-3: #8a8f9e;
+  --accent: #3b6cff;
+  --accent-2: #7a5cff;
+  --accent-3: #ff5c8a;
+  --good: #1aaf6c;
+  --warn: #f5a524;
+  --bad:  #e0445a;
+  --grad: linear-gradient(135deg,#3b6cff,#7a5cff 55%,#ff5c8a);
+  --grad-soft: linear-gradient(135deg,#eef2ff,#f5ecff 55%,#ffeef5);
+  --radius: 18px;
+  --radius-sm: 12px;
+  --radius-lg: 26px;
+  --shadow: 0 10px 30px rgba(18,24,40,.08), 0 2px 6px rgba(18,24,40,.04);
+  --shadow-lg: 0 24px 60px rgba(18,24,40,.14), 0 6px 16px rgba(18,24,40,.06);
+  --font-sans: 'Inter','Noto Sans SC',-apple-system,BlinkMacSystemFont,Helvetica,Arial,sans-serif;
+  --font-serif: 'Playfair Display','Noto Serif SC',Georgia,serif;
+  --font-mono: 'JetBrains Mono','IBM Plex Mono',SFMono-Regular,Menlo,monospace;
+  --font-display: var(--font-sans);
+  --letter-tight: -.03em;
+  --letter-normal: -.01em;
+  --ease: cubic-bezier(.4,0,.2,1);
+}
+
+*,*::before,*::after{box-sizing:border-box}
+html,body{margin:0;padding:0;background:var(--bg);color:var(--text-1);
+  font-family:var(--font-sans);font-weight:400;line-height:1.6;
+  -webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;
+  letter-spacing:var(--letter-normal)}
+img,svg,video{max-width:100%;display:block}
+a{color:var(--accent);text-decoration:none}
+a:hover{text-decoration:underline}
+code,kbd,pre,samp{font-family:var(--font-mono)}
+
+/* ================= SLIDE SYSTEM ================= */
+.deck{position:relative;width:100vw;height:100vh;overflow:hidden;background:var(--bg)}
+.slide{
+  position:absolute;inset:0;
+  display:flex;flex-direction:column;justify-content:center;
+  padding:72px 96px;
+  box-sizing:border-box;
+  opacity:0;pointer-events:none;
+  transition:opacity .5s var(--ease), transform .5s var(--ease);
+  transform:translateX(30px);
+  overflow:hidden;
+}
+.slide.is-active{opacity:1;pointer-events:auto;transform:translateX(0);z-index:2}
+.slide.is-prev{transform:translateX(-30px)}
+
+/* single-page standalone (used when a layout file is opened directly) */
+body.single .slide{position:relative;width:100vw;height:100vh;opacity:1;transform:none;pointer-events:auto}
+
+/* ================= TYPOGRAPHY ================= */
+.eyebrow{font-size:13px;font-weight:500;letter-spacing:.16em;text-transform:uppercase;color:var(--text-3)}
+.kicker{font-size:14px;font-weight:600;color:var(--accent);letter-spacing:.08em;text-transform:uppercase}
+h1.title,.h1{font-family:var(--font-display);font-size:72px;line-height:1.05;font-weight:800;letter-spacing:var(--letter-tight);margin:0 0 18px;color:var(--text-1)}
+h2.title,.h2{font-family:var(--font-display);font-size:54px;line-height:1.1;font-weight:700;letter-spacing:var(--letter-tight);margin:0 0 14px}
+h3,.h3{font-size:32px;line-height:1.2;font-weight:600;letter-spacing:var(--letter-normal);margin:0 0 10px}
+h4,.h4{font-size:22px;line-height:1.3;font-weight:600;margin:0 0 8px}
+.lede{font-size:22px;line-height:1.55;color:var(--text-2);font-weight:300;max-width:62ch}
+.dim{color:var(--text-2)}
+.dim2{color:var(--text-3)}
+.mono{font-family:var(--font-mono)}
+.serif{font-family:var(--font-serif)}
+.gradient-text{background:var(--grad);-webkit-background-clip:text;background-clip:text;-webkit-text-fill-color:transparent;color:transparent}
+
+/* ================= LAYOUT PRIMITIVES ================= */
+.stack>*+*{margin-top:14px}
+.row{display:flex;gap:24px;align-items:center}
+.row.wrap{flex-wrap:wrap}
+.grid{display:grid;gap:24px}
+.g2{grid-template-columns:repeat(2,1fr)}
+.g3{grid-template-columns:repeat(3,1fr)}
+.g4{grid-template-columns:repeat(4,1fr)}
+.center{display:flex;align-items:center;justify-content:center;text-align:center}
+.fill{flex:1}
+.sp-t{padding-top:24px}.sp-b{padding-bottom:24px}
+.mt-s{margin-top:8px}.mt-m{margin-top:18px}.mt-l{margin-top:32px}
+.mb-s{margin-bottom:8px}.mb-m{margin-bottom:18px}.mb-l{margin-bottom:32px}
+
+/* ================= CARDS ================= */
+.card{background:var(--surface);border:1px solid var(--border);border-radius:var(--radius);
+  padding:26px 28px;box-shadow:var(--shadow);position:relative;overflow:hidden}
+.card-soft{background:var(--surface-2);border:1px solid var(--border)}
+.card-outline{background:transparent;border:1.5px solid var(--border-strong);box-shadow:none}
+.card-accent{background:var(--surface);border-top:3px solid var(--accent)}
+.card-hover{transition:transform .3s var(--ease),box-shadow .3s var(--ease)}
+.card-hover:hover{transform:translateY(-4px);box-shadow:var(--shadow-lg)}
+
+.pill{display:inline-block;padding:4px 12px;border-radius:999px;font-size:12px;font-weight:500;
+  background:var(--surface-2);color:var(--text-2);border:1px solid var(--border)}
+.pill-accent{background:color-mix(in srgb,var(--accent) 12%,transparent);color:var(--accent);border-color:color-mix(in srgb,var(--accent) 28%,transparent)}
+
+/* ================= BARS / DIVIDERS ================= */
+.divider{height:1px;background:var(--border);width:100%}
+.divider-accent{height:3px;width:72px;background:var(--accent);border-radius:2px}
+
+/* ================= CHROME (header/footer/progress) ================= */
+.deck-header{position:absolute;top:24px;left:40px;right:40px;display:flex;align-items:center;justify-content:space-between;
+  font-size:12px;color:var(--text-3);letter-spacing:.12em;text-transform:uppercase;z-index:10;pointer-events:none}
+.deck-footer{position:absolute;bottom:24px;left:40px;right:40px;display:flex;align-items:center;justify-content:space-between;
+  font-size:12px;color:var(--text-3);z-index:10;pointer-events:none}
+.slide-number::before{content:attr(data-current)}
+.slide-number::after{content:" / " attr(data-total)}
+.progress-bar{position:fixed;left:0;right:0;bottom:0;height:3px;background:transparent;z-index:20}
+.progress-bar > span{display:block;height:100%;width:0;background:var(--accent);transition:width .3s var(--ease)}
+
+/* ================= PRESENTER / OVERVIEW ================= */
+.notes{display:none!important}
+.notes-overlay{position:fixed;inset:auto 0 0 0;max-height:42vh;background:rgba(20,22,30,.95);color:#e8ebf4;
+  padding:20px 32px;font-size:16px;line-height:1.6;border-top:1px solid rgba(255,255,255,.1);transform:translateY(100%);
+  transition:transform .3s var(--ease);z-index:40;overflow:auto;font-family:var(--font-sans)}
+.notes-overlay.open{transform:translateY(0)}
+.overview{position:fixed;inset:0;background:rgba(10,12,18,.92);backdrop-filter:blur(12px);z-index:50;
+  display:none;padding:40px;overflow:auto}
+.overview.open{display:grid;grid-template-columns:repeat(4,1fr);gap:20px;align-content:start}
+.overview .thumb{background:var(--surface);border:1px solid var(--border);border-radius:12px;
+  aspect-ratio:16/9;overflow:hidden;cursor:pointer;position:relative;color:var(--text-1);padding:16px;
+  font-size:11px;transition:transform .2s var(--ease)}
+.overview .thumb:hover{transform:scale(1.04)}
+.overview .thumb .n{position:absolute;top:8px;left:10px;font-weight:700;font-size:14px;color:var(--text-3)}
+.overview .thumb .t{position:absolute;bottom:10px;left:14px;right:14px;font-weight:600;color:var(--text-1)}
+
+/* ================= PRESENTER VIEW ================= */
+/* Presenter view opens in a separate popup window (S key).
+ * All presenter styles are self-contained in the popup HTML generated by runtime.js.
+ * The audience window (this file) is NOT affected — it stays as normal deck view.
+ * Only the .notes class below is needed to hide speaker notes from audience. */
+
+/* ================= UTILITY ================= */
+.hidden{display:none!important}
+.nowrap{white-space:nowrap}
+.tr{text-align:right}.tc{text-align:center}.tl{text-align:left}
+.uppercase{text-transform:uppercase;letter-spacing:.12em}
+
+/* ================= PRINT ================= */
+@media print{
+  .slide{position:relative;opacity:1!important;transform:none!important;page-break-after:always;height:100vh}
+  .deck-header,.deck-footer,.progress-bar,.notes-overlay,.overview{display:none!important}
+}
--- a/docs/presentations/skill-replay-eval/assets/fonts.css
+++ b/docs/presentations/skill-replay-eval/assets/fonts.css
@ -0,0 +1,9 @@
+/* html-ppt :: shared webfonts */
+@import url('https://fonts.googleapis.com/css2?family=Inter:wght@200;300;400;500;600;700;800;900&display=swap');
+@import url('https://fonts.googleapis.com/css2?family=Noto+Sans+SC:wght@200;300;400;500;600;700;900&display=swap');
+@import url('https://fonts.googleapis.com/css2?family=Noto+Serif+SC:wght@300;400;600;700&display=swap');
+@import url('https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500;700&display=swap');
+@import url('https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@0,400;0,600;0,800;1,400&display=swap');
+@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@300;400;500;600;700&display=swap');
+@import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@300;400;500;700&display=swap');
+@import url('https://fonts.googleapis.com/css2?family=Archivo+Black&display=swap');
--- a/docs/presentations/skill-replay-eval/assets/runtime.js
+++ b/docs/presentations/skill-replay-eval/assets/runtime.js
@ -0,0 +1,960 @@
+/* html-ppt :: runtime.js
+ * Keyboard-driven deck runtime. Zero dependencies.
+ *
+ * Features:
+ *   ← → / space / PgUp PgDn / Home End  navigation
+ *   F  fullscreen
+ *   S  presenter mode (opens a NEW WINDOW with current/next slide preview + notes + timer)
+ *       The original window stays as audience view, synced via BroadcastChannel.
+ *       Slide previews use CSS transform:scale() at design resolution for pixel-perfect layout.
+ *   N  quick notes overlay (bottom drawer)
+ *   O  slide overview grid
+ *   T  cycle themes (reads data-themes on <html> or <body>)
+ *   A  cycle demo animation on current slide
+ *   URL hash #/N  deep-link to slide N (1-based)
+ *   Progress bar auto-managed
+ */
+(function () {
+  'use strict';
+
+  const ANIMS = ['fade-up','fade-down','fade-left','fade-right','rise-in','drop-in',
+    'zoom-pop','blur-in','glitch-in','typewriter','neon-glow','shimmer-sweep',
+    'gradient-flow','stagger-list','counter-up','path-draw','parallax-tilt',
+    'card-flip-3d','cube-rotate-3d','page-turn-3d','perspective-zoom',
+    'marquee-scroll','kenburns','confetti-burst','spotlight','morph-shape','ripple-reveal'];
+
+  function ready(fn){ if(document.readyState!='loading')fn(); else document.addEventListener('DOMContentLoaded',fn);}
+
+  /* ========== Parse URL for preview-only mode ==========
+   * When loaded as iframe.src = "index.html?preview=3", runtime enters a
+   * locked single-slide mode: only slide N is visible, no chrome, no keys,
+   * no hash updates. This is how the presenter window shows pixel-perfect
+   * previews — by loading the actual deck file in an iframe and telling it
+   * to display only a specific slide.
+   */
+  function getPreviewIdx() {
+    const m = /[?&]preview=(\d+)/.exec(location.search || '');
+    return m ? parseInt(m[1], 10) - 1 : -1;
+  }
+
+  ready(function () {
+    const deck = document.querySelector('.deck');
+    if (!deck) return;
+    const slides = Array.from(deck.querySelectorAll('.slide'));
+    if (!slides.length) return;
+
+    const previewOnlyIdx = getPreviewIdx();
+    const isPreviewMode = previewOnlyIdx >= 0 && previewOnlyIdx < slides.length;
+
+    /* ===== Preview-only mode: show one slide, hide everything else ===== */
+    if (isPreviewMode) {
+      function showSlide(i) {
+        slides.forEach((s, j) => {
+          const active = (j === i);
+          s.classList.toggle('is-active', active);
+          s.style.display = active ? '' : 'none';
+          if (active) {
+            s.style.opacity = '1';
+            s.style.transform = 'none';
+            s.style.pointerEvents = 'auto';
+          }
+        });
+      }
+      showSlide(previewOnlyIdx);
+      /* Hide chrome that the presenter shouldn't see in preview */
+      const hideSel = '.progress-bar, .notes-overlay, .overview, .notes, aside.notes, .speaker-notes';
+      document.querySelectorAll(hideSel).forEach(el => { el.style.display = 'none'; });
+      document.documentElement.setAttribute('data-preview', '1');
+      document.body.setAttribute('data-preview', '1');
+      /* Auto-detect theme base path for theme switching in preview mode */
+      function getPreviewThemeBase() {
+        const base = document.documentElement.getAttribute('data-theme-base');
+        if (base) return base;
+        const tl = document.getElementById('theme-link');
+        if (tl) {
+          const raw = tl.getAttribute('href') || '';
+          const ls = raw.lastIndexOf('/');
+          if (ls >= 0) return raw.substring(0, ls + 1);
+        }
+        return 'assets/themes/';
+      }
+      const previewThemeBase = getPreviewThemeBase();
+
+      /* Listen for postMessage from parent presenter window:
+       *  - preview-goto: switch visible slide WITHOUT reloading
+       *  - preview-theme: switch theme CSS link to match audience window */
+      window.addEventListener('message', function(e) {
+        if (!e.data) return;
+        if (e.data.type === 'preview-goto') {
+          const n = parseInt(e.data.idx, 10);
+          if (n >= 0 && n < slides.length) showSlide(n);
+        } else if (e.data.type === 'preview-theme' && e.data.name) {
+          let link = document.getElementById('theme-link');
+          if (!link) {
+            link = document.createElement('link');
+            link.rel = 'stylesheet';
+            link.id = 'theme-link';
+            document.head.appendChild(link);
+          }
+          link.href = previewThemeBase + e.data.name + '.css';
+          document.documentElement.setAttribute('data-theme', e.data.name);
+        }
+      });
+      /* Signal to parent that preview iframe is ready */
+      try { window.parent && window.parent.postMessage({ type: 'preview-ready' }, '*'); } catch(e) {}
+      return;
+    }
+
+    let idx = 0;
+    const total = slides.length;
+
+    /* ===== BroadcastChannel for presenter sync ===== */
+    const CHANNEL_NAME = 'html-ppt-presenter-' + location.pathname;
+    let bc;
+    try { bc = new BroadcastChannel(CHANNEL_NAME); } catch(e) { bc = null; }
+
+    // Are we running inside the presenter popup? (legacy flag, now unused)
+    const isPresenterWindow = false;
+
+    /* ===== progress bar ===== */
+    let bar = document.querySelector('.progress-bar');
+    if (!bar) {
+      bar = document.createElement('div');
+      bar.className = 'progress-bar';
+      bar.innerHTML = '<span></span>';
+      document.body.appendChild(bar);
+    }
+    const barFill = bar.querySelector('span');
+
+    /* ===== notes overlay (N key) ===== */
+    let notes = document.querySelector('.notes-overlay');
+    if (!notes) {
+      notes = document.createElement('div');
+      notes.className = 'notes-overlay';
+      document.body.appendChild(notes);
+    }
+
+    /* ===== overview grid (O key) ===== */
+    let overview = document.querySelector('.overview');
+    if (!overview) {
+      overview = document.createElement('div');
+      overview.className = 'overview';
+      slides.forEach((s, i) => {
+        const t = document.createElement('div');
+        t.className = 'thumb';
+        // Force 16:9 aspect ratio robustly
+        t.style.padding = '0 0 56.25% 0';
+        t.style.height = '0';
+        t.style.position = 'relative';
+        t.style.overflow = 'hidden';
+
+        const title = s.getAttribute('data-title') ||
+          (s.querySelector('h1,h2,h3')||{}).textContent || ('Slide '+(i+1));
+        
+        // Create a container for the mini-slide
+        const mini = document.createElement('div');
+        mini.className = 'mini-slide';
+        mini.style.position = 'absolute';
+        mini.style.top = '0';
+        mini.style.left = '0';
+        mini.style.width = '1920px';
+        mini.style.height = '1080px';
+        mini.style.transformOrigin = 'top left';
+        mini.style.pointerEvents = 'none';
+        mini.style.background = 'var(--bg)';
+        
+        // Clone the slide content
+        const clone = s.cloneNode(true);
+        clone.className = 'slide is-active'; // force active styles
+        clone.style.position = 'absolute';
+        clone.style.inset = '0';
+        clone.style.transform = 'none';
+        clone.style.opacity = '1';
+        clone.style.padding = '72px 96px'; // ensure padding is kept
+        
+        mini.appendChild(clone);
+        t.appendChild(mini);
+
+        // Add the number and title overlay
+        const overlay = document.createElement('div');
+        overlay.style.position = 'absolute';
+        overlay.style.inset = '0';
+        overlay.style.background = 'linear-gradient(to bottom, rgba(0,0,0,0.2) 0%, transparent 40%, transparent 60%, rgba(0,0,0,0.8) 100%)';
+        overlay.style.color = '#fff';
+        overlay.style.zIndex = '10';
+        overlay.style.pointerEvents = 'none';
+        
+        const n = document.createElement('div');
+        n.className = 'n';
+        n.textContent = i + 1;
+        n.style.position = 'absolute';
+        n.style.top = '12px';
+        n.style.left = '16px';
+        n.style.fontWeight = '700';
+        n.style.fontSize = '16px';
+        n.style.color = '#fff';
+        n.style.textShadow = '0 1px 4px rgba(0,0,0,0.8)';
+        
+        const text = document.createElement('div');
+        text.className = 't';
+        text.textContent = title.trim().slice(0,80);
+        text.style.position = 'absolute';
+        text.style.bottom = '12px';
+        text.style.left = '16px';
+        text.style.right = '16px';
+        text.style.fontWeight = '600';
+        text.style.fontSize = '14px';
+        text.style.color = '#fff';
+        text.style.textShadow = '0 1px 4px rgba(0,0,0,0.8)';
+        
+        overlay.appendChild(n);
+        overlay.appendChild(text);
+        t.appendChild(overlay);
+
+        t.addEventListener('click', () => { go(i); toggleOverview(false); });
+        overview.appendChild(t);
+      });
+      document.body.appendChild(overview);
+    }
+
+    /* ===== navigation ===== */
+    function go(n, fromRemote){
+      n = Math.max(0, Math.min(total-1, n));
+      slides.forEach((s,i) => {
+        s.classList.toggle('is-active', i===n);
+        s.classList.toggle('is-prev', i<n);
+      });
+      idx = n;
+      barFill.style.width = ((n+1)/total*100)+'%';
+      const numEl = document.querySelector('.slide-number');
+      if (numEl) { numEl.setAttribute('data-current', n+1); numEl.setAttribute('data-total', total); }
+
+      // notes (bottom overlay)
+      const note = slides[n].querySelector('.notes, aside.notes, .speaker-notes');
+      notes.innerHTML = note ? note.innerHTML : '';
+
+      // hash
+      const hashTarget = '#/'+(n+1);
+      if (location.hash !== hashTarget && !isPresenterWindow) {
+        history.replaceState(null,'', hashTarget);
+      }
+
+      // re-trigger entry animations
+      slides[n].querySelectorAll('[data-anim]').forEach(el => {
+        const a = el.getAttribute('data-anim');
+        el.classList.remove('anim-'+a);
+        void el.offsetWidth;
+        el.classList.add('anim-'+a);
+      });
+
+      // counter-up
+      slides[n].querySelectorAll('.counter').forEach(el => {
+        const target = parseFloat(el.getAttribute('data-to')||el.textContent);
+        const dur = parseInt(el.getAttribute('data-dur')||'1200',10);
+        const start = performance.now();
+        const from = 0;
+        function tick(now){
+          const t = Math.min(1,(now-start)/dur);
+          const v = from + (target-from)*(1-Math.pow(1-t,3));
+          el.textContent = (target % 1 === 0) ? Math.round(v) : v.toFixed(1);
+          if (t<1) requestAnimationFrame(tick);
+        }
+        requestAnimationFrame(tick);
+      });
+
+      // Broadcast to other window (audience ↔ presenter)
+      if (!fromRemote && bc) {
+        bc.postMessage({ type: 'go', idx: n });
+      }
+    }
+
+    /* ===== listen for remote navigation / theme changes ===== */
+    if (bc) {
+      bc.onmessage = function(e) {
+        if (!e.data) return;
+        if (e.data.type === 'go' && typeof e.data.idx === 'number') {
+          go(e.data.idx, true);
+        } else if (e.data.type === 'theme' && e.data.name) {
+          /* Sync theme across windows */
+          const i = themes.indexOf(e.data.name);
+          if (i >= 0) themeIdx = i;
+          applyTheme(e.data.name);
+        }
+      };
+    }
+
+    function toggleNotes(force){ notes.classList.toggle('open', force!==undefined?force:!notes.classList.contains('open')); }
+    function toggleOverview(force){
+      const isOpen = force!==undefined ? force : !overview.classList.contains('open');
+      overview.classList.toggle('open', isOpen);
+      if (isOpen) {
+        requestAnimationFrame(() => {
+          const thumbs = overview.querySelectorAll('.thumb');
+          if (thumbs.length) {
+            const scale = thumbs[0].clientWidth / 1920;
+            overview.querySelectorAll('.mini-slide').forEach(m => {
+              m.style.transform = 'scale(' + scale + ')';
+            });
+          }
+        });
+      }
+    }
+
+    /* ========== PRESENTER MODE — Magnetic-card popup window ========== */
+    /* Opens a new window with 4 draggable, resizable cards:
+     *   CURRENT  — iframe(?preview=N)   pixel-perfect preview of current slide
+     *   NEXT     — iframe(?preview=N+1) pixel-perfect preview of next slide
+     *   SCRIPT   — large speaker notes (逐字稿)
+     *   TIMER    — elapsed timer + page counter + controls
+     * Cards remember position/size in localStorage.
+     * Two windows sync via BroadcastChannel.
+     */
+    let presenterWin = null;
+
+    function openPresenterWindow() {
+      if (presenterWin && !presenterWin.closed) {
+        presenterWin.focus();
+        return;
+      }
+
+      // Build absolute URL of THIS deck file (without hash/query)
+      const deckUrl = location.protocol + '//' + location.host + location.pathname;
+
+      // Collect slide titles + notes (HTML strings)
+      const slideMeta = slides.map((s, i) => {
+        const note = s.querySelector('.notes, aside.notes, .speaker-notes');
+        return {
+          title: s.getAttribute('data-title') ||
+            (s.querySelector('h1,h2,h3')||{}).textContent || ('Slide '+(i+1)),
+          notes: note ? note.innerHTML : ''
+        };
+      });
+
+      /* Capture current theme so presenter previews match the audience */
+      const currentTheme = root.getAttribute('data-theme') || (themes[themeIdx] || '');
+      const presenterHTML = buildPresenterHTML(deckUrl, slideMeta, total, idx, CHANNEL_NAME, currentTheme);
+
+      presenterWin = window.open('', 'html-ppt-presenter', 'width=1280,height=820,menubar=no,toolbar=no');
+      if (!presenterWin) {
+        alert('请允许弹出窗口以使用演讲者视图');
+        return;
+      }
+      presenterWin.document.open();
+      presenterWin.document.write(presenterHTML);
+      presenterWin.document.close();
+    }
+
+    function buildPresenterHTML(deckUrl, slideMeta, total, startIdx, channelName, currentTheme) {
+      const metaJSON = JSON.stringify(slideMeta);
+      const deckUrlJSON = JSON.stringify(deckUrl);
+      const channelJSON = JSON.stringify(channelName);
+      const themeJSON = JSON.stringify(currentTheme || '');
+      const storageKey = 'html-ppt-presenter:' + location.pathname;
+
+      // Build the document as a single template string for clarity
+      return `<!DOCTYPE html>
+<html lang="zh-CN">
+<head>
+<meta charset="utf-8">
+<title>Presenter View</title>
+<style>
+  * { margin: 0; padding: 0; box-sizing: border-box; }
+  html, body {
+    width: 100%; height: 100%; overflow: hidden;
+    background: #1a1d24;
+    background-image:
+      radial-gradient(circle at 20% 30%, rgba(88,166,255,.04), transparent 50%),
+      radial-gradient(circle at 80% 70%, rgba(188,140,255,.04), transparent 50%);
+    color: #e6edf3;
+    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Noto Sans SC", sans-serif;
+  }
+  /* Stage: positioned area where cards live */
+  #stage { position: absolute; inset: 0; overflow: hidden; }
+
+  /* Magnetic card */
+  .pcard {
+    position: absolute;
+    background: #0d1117;
+    border: 1px solid rgba(255,255,255,.1);
+    border-radius: 12px;
+    box-shadow: 0 8px 32px rgba(0,0,0,.45), 0 0 0 1px rgba(255,255,255,.02);
+    display: flex; flex-direction: column;
+    overflow: hidden;
+    min-width: 180px; min-height: 100px;
+    transition: box-shadow .2s, border-color .2s;
+  }
+  .pcard.dragging { box-shadow: 0 16px 48px rgba(0,0,0,.6), 0 0 0 2px rgba(88,166,255,.5); border-color: #58a6ff; transition: none; z-index: 9999; }
+  .pcard.resizing { box-shadow: 0 16px 48px rgba(0,0,0,.6), 0 0 0 2px rgba(63,185,80,.5); border-color: #3fb950; transition: none; z-index: 9999; }
+  .pcard:hover { border-color: rgba(88,166,255,.3); }
+
+  /* Card header (drag handle) */
+  .pcard-head {
+    display: flex; align-items: center; gap: 10px;
+    padding: 8px 12px;
+    background: rgba(255,255,255,.04);
+    border-bottom: 1px solid rgba(255,255,255,.06);
+    cursor: move;
+    user-select: none;
+    flex-shrink: 0;
+  }
+  .pcard-dot { width: 8px; height: 8px; border-radius: 50%; background: var(--dot-color, #58a6ff); flex-shrink: 0; }
+  .pcard-title {
+    font-size: 11px; letter-spacing: .15em; text-transform: uppercase;
+    font-weight: 700; color: #8b949e; flex: 1;
+  }
+  .pcard-meta { font-size: 11px; color: #6e7681; }
+
+  /* Card body */
+  .pcard-body { flex: 1; position: relative; overflow: hidden; min-height: 0; }
+
+  /* Preview cards (CURRENT/NEXT) — iframe-based pixel-perfect render */
+  .pcard-preview .pcard-body { background: #000; }
+  .pcard-preview iframe {
+    position: absolute; top: 0; left: 0;
+    width: 1920px; height: 1080px;
+    border: none;
+    transform-origin: top left;
+    pointer-events: none;
+    background: transparent;
+  }
+  .pcard-preview .preview-end {
+    position: absolute; inset: 0;
+    display: flex; align-items: center; justify-content: center;
+    color: #484f58; font-size: 14px; letter-spacing: .12em;
+  }
+
+  /* Notes card */
+  .pcard-notes .pcard-body {
+    padding: 14px 18px;
+    overflow-y: auto;
+    font-size: 18px; line-height: 1.75;
+    color: #d0d7de;
+    font-family: "Noto Sans SC", -apple-system, sans-serif;
+  }
+  .pcard-notes .pcard-body p { margin: 0 0 .7em 0; }
+  .pcard-notes .pcard-body strong { color: #f0883e; }
+  .pcard-notes .pcard-body em { color: #58a6ff; font-style: normal; }
+  .pcard-notes .pcard-body code {
+    font-family: "SF Mono", monospace; font-size: .9em;
+    background: rgba(255,255,255,.08); padding: 1px 6px; border-radius: 4px;
+  }
+  .pcard-notes .empty { color: #484f58; font-style: italic; }
+
+  /* Timer card */
+  .pcard-timer .pcard-body {
+    display: flex; flex-direction: column; gap: 14px;
+    padding: 18px 20px; justify-content: center;
+  }
+  .timer-display {
+    font-family: "SF Mono", "JetBrains Mono", monospace;
+    font-size: 42px; font-weight: 700;
+    color: #3fb950;
+    letter-spacing: .04em;
+    line-height: 1;
+  }
+  .timer-row {
+    display: flex; align-items: center; gap: 12px;
+    font-size: 14px; color: #8b949e;
+  }
+  .timer-row .label { font-size: 10px; letter-spacing: .15em; text-transform: uppercase; color: #6e7681; }
+  .timer-row .val { color: #e6edf3; font-weight: 600; font-family: "SF Mono", monospace; }
+  .timer-controls { display: flex; gap: 8px; flex-wrap: wrap; }
+  .timer-btn {
+    background: rgba(255,255,255,.06);
+    border: 1px solid rgba(255,255,255,.1);
+    color: #e6edf3;
+    padding: 6px 12px;
+    border-radius: 6px;
+    font-size: 12px;
+    cursor: pointer;
+    font-family: inherit;
+  }
+  .timer-btn:hover { background: rgba(88,166,255,.15); border-color: #58a6ff; }
+  .timer-btn:active { transform: translateY(1px); }
+
+  /* Resize handle */
+  .pcard-resize {
+    position: absolute; right: 0; bottom: 0;
+    width: 18px; height: 18px;
+    cursor: nwse-resize;
+    background: linear-gradient(135deg, transparent 50%, rgba(255,255,255,.25) 50%, rgba(255,255,255,.25) 60%, transparent 60%, transparent 70%, rgba(255,255,255,.25) 70%, rgba(255,255,255,.25) 80%, transparent 80%);
+    z-index: 5;
+  }
+  .pcard-resize:hover { background: linear-gradient(135deg, transparent 50%, #58a6ff 50%, #58a6ff 60%, transparent 60%, transparent 70%, #58a6ff 70%, #58a6ff 80%, transparent 80%); }
+
+  /* Bottom hint bar */
+  .hint-bar {
+    position: fixed; bottom: 0; left: 0; right: 0;
+    background: rgba(0,0,0,.6);
+    backdrop-filter: blur(10px);
+    border-top: 1px solid rgba(255,255,255,.08);
+    padding: 6px 16px;
+    font-size: 11px; color: #8b949e;
+    display: flex; gap: 18px; align-items: center;
+    z-index: 1000;
+  }
+  .hint-bar kbd {
+    background: rgba(255,255,255,.08);
+    padding: 1px 6px; border-radius: 3px;
+    font-family: "SF Mono", monospace;
+    font-size: 10px;
+    border: 1px solid rgba(255,255,255,.1);
+    color: #e6edf3;
+  }
+  .hint-bar .reset-layout {
+    margin-left: auto;
+    background: transparent; border: 1px solid rgba(255,255,255,.15);
+    color: #8b949e; padding: 3px 10px; border-radius: 4px;
+    font-size: 11px; cursor: pointer; font-family: inherit;
+  }
+  .hint-bar .reset-layout:hover { background: rgba(248,81,73,.15); border-color: #f85149; color: #f85149; }
+
+  body.is-dragging-card * { user-select: none !important; }
+  body.is-dragging-card iframe { pointer-events: none !important; }
+</style>
+</head>
+<body>
+
+<div id="stage">
+  <div class="pcard pcard-preview" id="card-cur" style="--dot-color:#58a6ff">
+    <div class="pcard-head" data-drag>
+      <span class="pcard-dot"></span>
+      <span class="pcard-title">CURRENT</span>
+      <span class="pcard-meta" id="cur-meta">—</span>
+    </div>
+    <div class="pcard-body"><iframe id="iframe-cur"></iframe></div>
+    <div class="pcard-resize" data-resize></div>
+  </div>
+
+  <div class="pcard pcard-preview" id="card-nxt" style="--dot-color:#bc8cff">
+    <div class="pcard-head" data-drag>
+      <span class="pcard-dot"></span>
+      <span class="pcard-title">NEXT</span>
+      <span class="pcard-meta" id="nxt-meta">—</span>
+    </div>
+    <div class="pcard-body"><iframe id="iframe-nxt"></iframe></div>
+    <div class="pcard-resize" data-resize></div>
+  </div>
+
+  <div class="pcard pcard-notes" id="card-notes" style="--dot-color:#f0883e">
+    <div class="pcard-head" data-drag>
+      <span class="pcard-dot"></span>
+      <span class="pcard-title">SPEAKER SCRIPT · 逐字稿</span>
+    </div>
+    <div class="pcard-body" id="notes-body"></div>
+    <div class="pcard-resize" data-resize></div>
+  </div>
+
+  <div class="pcard pcard-timer" id="card-timer" style="--dot-color:#3fb950">
+    <div class="pcard-head" data-drag>
+      <span class="pcard-dot"></span>
+      <span class="pcard-title">TIMER</span>
+    </div>
+    <div class="pcard-body">
+      <div class="timer-display" id="timer-display">00:00</div>
+      <div class="timer-row">
+        <span class="label">Slide</span>
+        <span class="val" id="timer-count">1 / ${total}</span>
+      </div>
+      <div class="timer-controls">
+        <button class="timer-btn" id="btn-prev">← Prev</button>
+        <button class="timer-btn" id="btn-next">Next →</button>
+        <button class="timer-btn" id="btn-reset">⏱ Reset</button>
+      </div>
+    </div>
+    <div class="pcard-resize" data-resize></div>
+  </div>
+</div>
+
+<div class="hint-bar">
+  <span><kbd>← →</kbd> 翻页</span>
+  <span><kbd>R</kbd> 重置计时</span>
+  <span><kbd>Esc</kbd> 关闭</span>
+  <span style="color:#6e7681">拖动卡片头部移动 · 拖动右下角调整大小</span>
+  <button class="reset-layout" id="reset-layout">重置布局</button>
+</div>
+
+<script>
+(function(){
+  var slideMeta = ${metaJSON};
+  var total = ${total};
+  var idx = ${startIdx};
+  var deckUrl = ${deckUrlJSON};
+  var STORAGE_KEY = ${JSON.stringify(storageKey)};
+  var bc;
+  try { bc = new BroadcastChannel(${channelJSON}); } catch(e) {}
+
+  var iframeCur = document.getElementById('iframe-cur');
+  var iframeNxt = document.getElementById('iframe-nxt');
+  var notesBody = document.getElementById('notes-body');
+  var curMeta = document.getElementById('cur-meta');
+  var nxtMeta = document.getElementById('nxt-meta');
+  var timerDisplay = document.getElementById('timer-display');
+  var timerCount = document.getElementById('timer-count');
+
+  /* ===== Default card layout ===== */
+  function defaultLayout() {
+    var w = window.innerWidth;
+    var h = window.innerHeight - 36; /* leave room for hint bar */
+    return {
+      'card-cur':   { x: 16,        y: 16,            w: Math.round(w*0.55) - 24, h: Math.round(h*0.62) - 16 },
+      'card-nxt':   { x: Math.round(w*0.55) + 8, y: 16, w: w - Math.round(w*0.55) - 24, h: Math.round(h*0.42) - 16 },
+      'card-notes': { x: Math.round(w*0.55) + 8, y: Math.round(h*0.42) + 8, w: w - Math.round(w*0.55) - 24, h: h - Math.round(h*0.42) - 16 },
+      'card-timer': { x: 16,        y: Math.round(h*0.62) + 8, w: Math.round(w*0.55) - 24, h: h - Math.round(h*0.62) - 16 }
+    };
+  }
+
+  /* ===== Apply / save / restore layout ===== */
+  function applyLayout(layout) {
+    Object.keys(layout).forEach(function(id){
+      var el = document.getElementById(id);
+      var l = layout[id];
+      if (el && l) {
+        el.style.left = l.x + 'px';
+        el.style.top = l.y + 'px';
+        el.style.width = l.w + 'px';
+        el.style.height = l.h + 'px';
+      }
+    });
+    rescaleAll();
+  }
+  function readLayout() {
+    try {
+      var saved = localStorage.getItem(STORAGE_KEY);
+      if (saved) return JSON.parse(saved);
+    } catch(e) {}
+    return defaultLayout();
+  }
+  function saveLayout() {
+    var layout = {};
+    ['card-cur','card-nxt','card-notes','card-timer'].forEach(function(id){
+      var el = document.getElementById(id);
+      if (el) {
+        layout[id] = {
+          x: parseInt(el.style.left,10) || 0,
+          y: parseInt(el.style.top,10) || 0,
+          w: parseInt(el.style.width,10) || 300,
+          h: parseInt(el.style.height,10) || 200
+        };
+      }
+    });
+    try { localStorage.setItem(STORAGE_KEY, JSON.stringify(layout)); } catch(e) {}
+  }
+
+  /* ===== iframe rescale to fit card body ===== */
+  function rescaleIframe(iframe) {
+    if (!iframe || iframe.style.display === 'none') return;
+    var body = iframe.parentElement;
+    var cw = body.clientWidth, ch = body.clientHeight;
+    if (!cw || !ch) return;
+    var s = Math.min(cw / 1920, ch / 1080);
+    iframe.style.transform = 'scale(' + s + ')';
+    /* Center the scaled iframe in the body */
+    var sw = 1920 * s, sh = 1080 * s;
+    iframe.style.left = Math.max(0, (cw - sw) / 2) + 'px';
+    iframe.style.top = Math.max(0, (ch - sh) / 2) + 'px';
+  }
+  function rescaleAll() {
+    rescaleIframe(iframeCur);
+    rescaleIframe(iframeNxt);
+  }
+  window.addEventListener('resize', rescaleAll);
+
+  /* ===== Drag (move card by header) ===== */
+  document.querySelectorAll('[data-drag]').forEach(function(handle){
+    handle.addEventListener('mousedown', function(e){
+      if (e.button !== 0) return;
+      var card = handle.closest('.pcard');
+      if (!card) return;
+      e.preventDefault();
+      card.classList.add('dragging');
+      document.body.classList.add('is-dragging-card');
+      var startX = e.clientX, startY = e.clientY;
+      var startL = parseInt(card.style.left,10) || 0;
+      var startT = parseInt(card.style.top,10)  || 0;
+      function onMove(ev){
+        var nx = Math.max(0, Math.min(window.innerWidth - 100, startL + ev.clientX - startX));
+        var ny = Math.max(0, Math.min(window.innerHeight - 50, startT + ev.clientY - startY));
+        card.style.left = nx + 'px';
+        card.style.top = ny + 'px';
+      }
+      function onUp(){
+        card.classList.remove('dragging');
+        document.body.classList.remove('is-dragging-card');
+        document.removeEventListener('mousemove', onMove);
+        document.removeEventListener('mouseup', onUp);
+        saveLayout();
+      }
+      document.addEventListener('mousemove', onMove);
+      document.addEventListener('mouseup', onUp);
+    });
+  });
+
+  /* ===== Resize (drag bottom-right corner) ===== */
+  document.querySelectorAll('[data-resize]').forEach(function(handle){
+    handle.addEventListener('mousedown', function(e){
+      if (e.button !== 0) return;
+      var card = handle.closest('.pcard');
+      if (!card) return;
+      e.preventDefault(); e.stopPropagation();
+      card.classList.add('resizing');
+      document.body.classList.add('is-dragging-card');
+      var startX = e.clientX, startY = e.clientY;
+      var startW = parseInt(card.style.width,10)  || card.offsetWidth;
+      var startH = parseInt(card.style.height,10) || card.offsetHeight;
+      function onMove(ev){
+        var nw = Math.max(180, startW + ev.clientX - startX);
+        var nh = Math.max(100, startH + ev.clientY - startY);
+        card.style.width = nw + 'px';
+        card.style.height = nh + 'px';
+        if (card.querySelector('iframe')) rescaleIframe(card.querySelector('iframe'));
+      }
+      function onUp(){
+        card.classList.remove('resizing');
+        document.body.classList.remove('is-dragging-card');
+        document.removeEventListener('mousemove', onMove);
+        document.removeEventListener('mouseup', onUp);
+        rescaleAll();
+        saveLayout();
+      }
+      document.addEventListener('mousemove', onMove);
+      document.addEventListener('mouseup', onUp);
+    });
+  });
+
+  /* ===== Preview iframe ready tracking =====
+   * Each iframe loads the deck ONCE with ?preview=1 on init. Subsequent
+   * slide changes are sent via postMessage('preview-goto') so the iframe
+   * just toggles visibility of a different .slide — no reload, no flicker.
+   */
+  var iframeReady = { cur: false, nxt: false };
+  var currentTheme = ${themeJSON};
+  window.addEventListener('message', function(e) {
+    if (!e.data || e.data.type !== 'preview-ready') return;
+    var iframe = null;
+    if (e.source === iframeCur.contentWindow) {
+      iframeReady.cur = true;
+      iframe = iframeCur;
+      postPreviewGoto(iframeCur, idx);
+    } else if (e.source === iframeNxt.contentWindow) {
+      iframeReady.nxt = true;
+      iframe = iframeNxt;
+      postPreviewGoto(iframeNxt, idx + 1 < total ? idx + 1 : idx);
+    }
+    /* Sync current theme to the iframe */
+    if (iframe && currentTheme) {
+      try { iframe.contentWindow.postMessage({ type: 'preview-theme', name: currentTheme }, '*'); } catch(err) {}
+    }
+    if (iframe) rescaleIframe(iframe);
+  });
+
+  function postPreviewGoto(iframe, n) {
+    try {
+      iframe.contentWindow.postMessage({ type: 'preview-goto', idx: n }, '*');
+    } catch(e) {}
+  }
+
+  /* ===== Update content =====
+   * Smooth (no-reload) navigation: send postMessage to iframes instead of
+   * resetting src. Iframes stay loaded, just switch visible .slide.
+   */
+  function update(n) {
+    n = Math.max(0, Math.min(total - 1, n));
+    idx = n;
+
+    /* Current preview — postMessage (smooth) */
+    if (iframeReady.cur) postPreviewGoto(iframeCur, n);
+    curMeta.textContent = (n + 1) + '/' + total;
+
+    /* Next preview */
+    if (n + 1 < total) {
+      iframeNxt.style.display = '';
+      var endEl = document.querySelector('#card-nxt .preview-end');
+      if (endEl) endEl.remove();
+      if (iframeReady.nxt) postPreviewGoto(iframeNxt, n + 1);
+      nxtMeta.textContent = (n + 2) + '/' + total;
+    } else {
+      iframeNxt.style.display = 'none';
+      var body = document.querySelector('#card-nxt .pcard-body');
+      if (body && !body.querySelector('.preview-end')) {
+        var end = document.createElement('div');
+        end.className = 'preview-end';
+        end.textContent = '— END OF DECK —';
+        body.appendChild(end);
+      }
+      nxtMeta.textContent = 'END';
+    }
+
+    /* Notes */
+    var note = slideMeta[n].notes;
+    notesBody.innerHTML = note || '<span class="empty">（这一页还没有逐字稿）</span>';
+
+    /* Timer count */
+    timerCount.textContent = (n + 1) + ' / ' + total;
+  }
+
+  /* ===== Timer ===== */
+  var tStart = Date.now();
+  setInterval(function(){
+    var s = Math.floor((Date.now() - tStart) / 1000);
+    var mm = String(Math.floor(s/60)).padStart(2,'0');
+    var ss = String(s%60).padStart(2,'0');
+    timerDisplay.textContent = mm + ':' + ss;
+  }, 1000);
+  function resetTimer(){ tStart = Date.now(); timerDisplay.textContent = '00:00'; }
+
+  /* ===== BroadcastChannel sync ===== */
+  if (bc) {
+    bc.onmessage = function(e){
+      if (!e.data) return;
+      if (e.data.type === 'go') update(e.data.idx);
+      else if (e.data.type === 'theme' && e.data.name) {
+        currentTheme = e.data.name;
+        /* Forward theme change to preview iframes */
+        [iframeCur, iframeNxt].forEach(function(iframe){
+          try {
+            iframe.contentWindow.postMessage({ type: 'preview-theme', name: e.data.name }, '*');
+          } catch(err) {}
+        });
+      }
+    };
+  }
+  function go(n) {
+    update(n);
+    if (bc) bc.postMessage({ type: 'go', idx: idx });
+  }
+
+  /* ===== Buttons ===== */
+  document.getElementById('btn-prev').addEventListener('click', function(){ go(idx - 1); });
+  document.getElementById('btn-next').addEventListener('click', function(){ go(idx + 1); });
+  document.getElementById('btn-reset').addEventListener('click', resetTimer);
+  document.getElementById('reset-layout').addEventListener('click', function(){
+    if (confirm('恢复默认卡片布局？')) {
+      try { localStorage.removeItem(STORAGE_KEY); } catch(e){}
+      applyLayout(defaultLayout());
+    }
+  });
+
+  /* ===== Keyboard ===== */
+  document.addEventListener('keydown', function(e){
+    if (e.metaKey || e.ctrlKey || e.altKey) return;
+    switch(e.key) {
+      case 'ArrowRight': case ' ': case 'PageDown': go(idx + 1); e.preventDefault(); break;
+      case 'ArrowLeft':  case 'PageUp':   go(idx - 1); e.preventDefault(); break;
+      case 'Home': go(0); break;
+      case 'End':  go(total - 1); break;
+      case 'r': case 'R': resetTimer(); break;
+      case 'Escape': window.close(); break;
+    }
+  });
+
+  /* ===== Iframe load → rescale (catches initial size) ===== */
+  iframeCur.addEventListener('load', function(){ rescaleIframe(iframeCur); });
+  iframeNxt.addEventListener('load', function(){ rescaleIframe(iframeNxt); });
+
+  /* ===== Init =====
+   * Load each iframe ONCE with the deck file. After they post
+   * 'preview-ready', all subsequent navigation is via postMessage
+   * (smooth, no reload, no flicker).
+   */
+  applyLayout(readLayout());
+  iframeCur.src = deckUrl + '?preview=' + (idx + 1);
+  if (idx + 1 < total) iframeNxt.src = deckUrl + '?preview=' + (idx + 2);
+  /* Initialize notes/timer/count without touching iframes */
+  notesBody.innerHTML = slideMeta[idx].notes || '<span class="empty">（这一页还没有逐字稿）</span>';
+  curMeta.textContent = (idx + 1) + '/' + total;
+  nxtMeta.textContent = (idx + 2) + '/' + total;
+  timerCount.textContent = (idx + 1) + ' / ' + total;
+})();
+</` + `script>
+</body></html>`;
+    }
+
+    function fullscreen(){ const el=document.documentElement;
+      if (!document.fullscreenElement) el.requestFullscreen&&el.requestFullscreen();
+      else document.exitFullscreen&&document.exitFullscreen();
+    }
+
+    // theme cycling
+    const root = document.documentElement;
+    const themesAttr = root.getAttribute('data-themes') || document.body.getAttribute('data-themes');
+    const themes = themesAttr ? themesAttr.split(',').map(s=>s.trim()).filter(Boolean) : [];
+    let themeIdx = 0;
+
+    // Auto-detect theme base path from existing <link id="theme-link">
+    let themeBase = root.getAttribute('data-theme-base');
+    if (!themeBase) {
+      const existingLink = document.getElementById('theme-link');
+      if (existingLink) {
+        // el.getAttribute('href') gives the raw relative path written in HTML
+        const rawHref = existingLink.getAttribute('href') || '';
+        const lastSlash = rawHref.lastIndexOf('/');
+        themeBase = lastSlash >= 0 ? rawHref.substring(0, lastSlash + 1) : 'assets/themes/';
+      } else {
+        themeBase = 'assets/themes/';
+      }
+    }
+
+    function applyTheme(name) {
+      let link = document.getElementById('theme-link');
+      if (!link) {
+        link = document.createElement('link');
+        link.rel = 'stylesheet';
+        link.id = 'theme-link';
+        document.head.appendChild(link);
+      }
+      link.href = themeBase + name + '.css';
+      root.setAttribute('data-theme', name);
+      const ind = document.querySelector('.theme-indicator');
+      if (ind) ind.textContent = name;
+    }
+    function cycleTheme(fromRemote){
+      if (!themes.length) return;
+      themeIdx = (themeIdx+1) % themes.length;
+      const name = themes[themeIdx];
+      applyTheme(name);
+      /* Broadcast to other window (audience ↔ presenter) */
+      if (!fromRemote && bc) bc.postMessage({ type: 'theme', name: name });
+    }
+
+    // animation cycling on current slide
+    let animIdx = 0;
+    function cycleAnim(){
+      animIdx = (animIdx+1) % ANIMS.length;
+      const a = ANIMS[animIdx];
+      const target = slides[idx].querySelector('[data-anim-target]') || slides[idx];
+      ANIMS.forEach(x => target.classList.remove('anim-'+x));
+      void target.offsetWidth;
+      target.classList.add('anim-'+a);
+      target.setAttribute('data-anim', a);
+      const ind = document.querySelector('.anim-indicator');
+      if (ind) ind.textContent = a;
+    }
+
+    document.addEventListener('keydown', function (e) {
+      if (e.metaKey||e.ctrlKey||e.altKey) return;
+      switch (e.key) {
+        case 'ArrowRight': case ' ': case 'PageDown': case 'Enter': go(idx+1); e.preventDefault(); break;
+        case 'ArrowLeft': case 'PageUp': case 'Backspace': go(idx-1); e.preventDefault(); break;
+        case 'Home': go(0); break;
+        case 'End': go(total-1); break;
+        case 'f': case 'F': fullscreen(); break;
+        case 's': case 'S': openPresenterWindow(); break;
+        case 'n': case 'N': toggleNotes(); break;
+        case 'o': case 'O': toggleOverview(); break;
+        case 't': case 'T': cycleTheme(); break;
+        case 'a': case 'A': cycleAnim(); break;
+        case 'Escape': toggleOverview(false); toggleNotes(false); break;
+      }
+    });
+
+    // hash deep-link
+    function fromHash(){
+      const m = /^#\/(\d+)/.exec(location.hash||'');
+      if (m) go(Math.max(0, parseInt(m[1],10)-1));
+    }
+    window.addEventListener('hashchange', fromHash);
+    fromHash();
+    go(idx);
+  });
+})();
--- a/docs/presentations/skill-replay-eval/index.html
+++ b/docs/presentations/skill-replay-eval/index.html
@ -0,0 +1,287 @@
+<!DOCTYPE html>
+<html lang="zh-CN" class="replay-root">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>Beaver Agent Sandbox · 客户方案介绍</title>
+  <link rel="stylesheet" href="assets/fonts.css">
+  <link rel="stylesheet" href="assets/base.css">
+  <link rel="stylesheet" href="assets/animations/animations.css">
+  <link rel="stylesheet" href="style.css">
+</head>
+<body class="tpl-beaver-replay">
+<div class="deck">
+
+  <section class="slide" data-title="Cover">
+    <p class="kicker">Beaver Agent Sandbox</p>
+    <h1 class="h1">企业级智能体沙盒<br><span class="gradient-text">从 AI 对话到可交付任务</span></h1>
+    <p class="lede">为企业提供可私有部署、可追踪、可验收、可复用的 AI Agent 工作台，让智能体真正进入业务流程，而不只是停留在聊天窗口。</p>
+    <div class="speaker">
+      <div class="av"></div>
+      <div><b>Beaver 客户方案介绍</b><span>产品展示 · 商业价值 · 落地路径</span></div>
+    </div>
+    <div class="deck-footer"><span>Agent Sandbox for enterprise teams</span><span class="slide-number" data-current="1" data-total="13"></span></div>
+    <aside class="notes">
+      开场不要先讲技术，而是讲客户能听懂的定位：Beaver 是一个企业级智能体沙盒。它的价值不是“又一个聊天机器人”，而是把 AI 从问答推进到任务执行、过程追踪、结果验收和经验复用。客户最关心的是能不能落地到真实工作、能不能管控风险、能不能把成功经验变成组织资产。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Customer Problem">
+    <p class="kicker">why now</p>
+    <h2 class="h2">大多数企业 AI 试点卡在“能聊”，但离“能交付”还差一层操作系统。</h2>
+    <div class="grid g3 mt-l">
+      <div class="card card-accent"><h4>结果不可验收</h4><p class="dim">模型回答看起来合理，但缺少任务状态、修改闭环、产物管理和用户确认。</p><span class="tag bad mt-s">chat-only</span></div>
+      <div class="card card-accent"><h4>过程不可审计</h4><p class="dim">工具用了什么、文件改了什么、依据来自哪里，常常没有清晰证据链。</p><span class="tag warn mt-s">black box</span></div>
+      <div class="card card-accent"><h4>经验不可复用</h4><p class="dim">一次成功交付没有沉淀成团队方法，下一次仍然依赖人工提示和临场判断。</p><span class="tag mt-s">one-off</span></div>
+    </div>
+    <div class="panel mt-l">
+      <span class="tag good">Beaver 的切入点</span>
+      <p class="lede mt-s">把智能体运行所需的任务、工具、文件、记忆、技能、验收和多实例部署统一到一个可控沙盒里。</p>
+    </div>
+    <div class="deck-footer"><span>customer pain: execution, control, reuse</span><span class="slide-number" data-current="2" data-total="13"></span></div>
+    <aside class="notes">
+      这一页用客户语言讲痛点。企业不是没有模型，也不是不能接一个聊天入口。真正的问题是：回答之后谁来确认？执行过程能不能追溯？文件和工具调用有没有边界？成功经验能不能变成下次自动调用的方法？Beaver 的价值就是补上这层“智能体操作系统”。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Product Positioning">
+    <p class="kicker">positioning</p>
+    <h2 class="h2">Beaver 的定位：企业 AI Agent 的执行与治理平台。</h2>
+    <div class="flow mt-l">
+      <div class="flow-step"><span class="n">01</span><h4>识别</h4><p>判断用户是在普通对话，还是在交办需要持续完成的任务。</p></div>
+      <div class="flow-step"><span class="n">02</span><h4>执行</h4><p>按任务选择模型、技能和工具，处理文件、搜索、终端或外部连接器。</p></div>
+      <div class="flow-step"><span class="n">03</span><h4>追踪</h4><p>记录过程、工具调用、子任务、产物、通知和执行证据。</p></div>
+      <div class="flow-step"><span class="n">04</span><h4>验收</h4><p>支持满意、修改、放弃，让用户反馈成为质量闭环。</p></div>
+      <div class="flow-step card-accent"><span class="n">05</span><h4>沉淀</h4><p>把被认可的工作方法转为技能和长期记忆，形成组织资产。</p></div>
+    </div>
+    <div class="metric-grid mt-l">
+      <div class="metric"><span>deployment</span><b>多实例</b><p class="dim">每个用户/团队可拥有独立 app-instance。</p></div>
+      <div class="metric"><span>workspace</span><b>沙盒</b><p class="dim">文件、配置和运行数据在实例边界内管理。</p></div>
+      <div class="metric"><span>control</span><b>验收</b><p class="dim">AI 产出以用户是否认可作为闭环信号。</p></div>
+      <div class="metric"><span>growth</span><b>技能库</b><p class="dim">成功任务经验可持续复用。</p></div>
+    </div>
+    <div class="deck-footer"><span>not a chatbot, an agent execution layer</span><span class="slide-number" data-current="3" data-total="13"></span></div>
+    <aside class="notes">
+      这里要把产品定位说清楚：Beaver 不是单纯的聊天前端，也不是一个模型代理。它是一层智能体执行与治理平台。用户一句话进来，系统能判断是否进入任务模式，随后执行、追踪、验收，并把成功经验沉淀为长期能力。这就是客户购买的核心价值。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Product Modules">
+    <p class="kicker">product modules</p>
+    <h2 class="h2">一套完整工作台：从日常协作到工具治理都在同一界面。</h2>
+    <div class="roadmap mt-l">
+      <div class="item"><span>01</span><b>对话工作台</b><p>会话、附件、Agent 运行过程、当前任务进度和验收操作。</p></div>
+      <div class="item"><span>02</span><b>任务中心</b><p>普通任务、定时任务、任务详情、时间线和结果验收。</p></div>
+      <div class="item"><span>03</span><b>文件空间</b><p>上传、目录管理、Markdown/文本/图片预览、下载和删除。</p></div>
+      <div class="item"><span>04</span><b>技能与市场</b><p>企业技能库、草稿评审、发布门禁和技能安装。</p></div>
+    </div>
+    <div class="roadmap mt-m">
+      <div class="item"><span>05</span><b>工具管理</b><p>MCP 工具配置、工具详情、测试、编辑和删除。</p></div>
+      <div class="item"><span>06</span><b>通知与定时</b><p>周期任务、主动提醒、运行记录和后续修改。</p></div>
+      <div class="item"><span>07</span><b>连接器</b><p>Outlook 等外部系统接入，承接邮件、日历和业务入口。</p></div>
+      <div class="item"><span>08</span><b>配置中心</b><p>模型供应商、Agent profile、通道、系统状态和运行参数。</p></div>
+    </div>
+    <div class="deck-footer"><span>customer-facing workspace, admin-facing control</span><span class="slide-number" data-current="4" data-total="13"></span></div>
+    <aside class="notes">
+      这一页适合给客户展示产品范围。Beaver 不是单点工具，而是一套工作台。对普通用户来说，有对话、任务、文件和通知。对管理员或实施团队来说，有技能、工具、连接器、模型配置和系统状态。这样客户能看到它不是 demo，而是可以承载真实使用流程的产品。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Business Scenarios">
+    <p class="kicker">use cases</p>
+    <h2 class="h2">典型客户场景：高频、跨工具、需要留痕的知识工作。</h2>
+    <div class="grid g3 mt-l">
+      <div class="card card-accent"><h4>项目交付助手</h4><p class="dim">梳理需求、拆任务、生成方案、跟踪修改意见，把交付过程沉淀为可复用模板。</p></div>
+      <div class="card card-accent"><h4>运营与周报自动化</h4><p class="dim">定时触发数据整理、状态汇总、风险提醒和通知推送，减少重复人工跟进。</p></div>
+      <div class="card card-accent"><h4>企业知识与文件处理</h4><p class="dim">围绕 workspace 文件、历史任务和业务知识进行整理、摘要、审查和产物生成。</p></div>
+      <div class="card card-accent"><h4>研发与技术支持</h4><p class="dim">分析代码、执行命令、读取日志、记录证据，为工程团队提供可追溯协作。</p></div>
+      <div class="card card-accent"><h4>销售与客户成功</h4><p class="dim">沉淀客户上下文、准备沟通材料、跟踪待办，并与邮件日历等连接器协同。</p></div>
+      <div class="card card-accent"><h4>内部 AI 能力平台</h4><p class="dim">让不同团队共用安全边界、工具管理、技能市场和多模型供应商策略。</p></div>
+    </div>
+    <div class="deck-footer"><span>best fit: repeatable workflows with review requirements</span><span class="slide-number" data-current="5" data-total="13"></span></div>
+    <aside class="notes">
+      这里不要讲单一行业，而是讲适合 Beaver 的任务类型：高频、跨工具、需要留痕、结果需要验收。客户会自然映射到自己的场景，比如项目管理、运营报告、技术支持、知识库维护、客户成功和内部 AI 平台。关键是让客户看到 Beaver 能进入真实工作流。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Comparison">
+    <p class="kicker">competitive edge</p>
+    <h2 class="h2">优势对比：Beaver 补齐聊天、RPA 和通用 Agent 框架之间的空白。</h2>
+    <div class="matrix mt-l">
+      <div class="head">能力维度</div><div class="head">普通 AI 聊天</div><div class="head">传统自动化/RPA</div><div class="head">Beaver Agent Sandbox</div>
+      <div class="rowhead">任务生命周期</div><div>以消息为中心</div><div>以固定流程为中心</div><div><span class="tag good">识别、执行、验收、复用闭环</span></div>
+      <div class="rowhead">工具与文件</div><div>通常只生成建议</div><div>能执行但流程僵硬</div><div><span class="tag good">技能指导工具调用，过程留痕</span></div>
+      <div class="rowhead">用户控制</div><div>缺少明确交付确认</div><div>改流程成本较高</div><div><span class="tag good">满意、修改、放弃进入任务状态</span></div>
+      <div class="rowhead">经验沉淀</div><div>依赖聊天记录</div><div>依赖人工维护脚本</div><div><span class="tag good">成功任务转技能和长期记忆</span></div>
+      <div class="rowhead">部署边界</div><div>SaaS 居多</div><div>企业内复杂集成</div><div><span class="tag good">Docker 多实例沙盒，适配私有部署</span></div>
+    </div>
+    <div class="deck-footer"><span>differentiation: task closure + evidence + reusable skills</span><span class="slide-number" data-current="6" data-total="13"></span></div>
+    <aside class="notes">
+      这一页是客户很关心的“为什么不是已有方案”。普通聊天工具擅长生成内容，但缺少任务闭环和治理。RPA 能执行，但通常流程固定、维护成本高。通用 Agent 框架适合开发者搭系统，但客户还需要完整工作台、验收和管理界面。Beaver 的差异化在于把执行、证据、验收和经验沉淀做成一套产品。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Architecture For Customers">
+    <p class="kicker">deployment model</p>
+    <h2 class="h2">客户可理解的部署模型：入口统一，实例隔离，数据边界清晰。</h2>
+    <div class="flow mt-l">
+      <div class="flow-step"><span class="n">01</span><h4>认证门户</h4><p>用户注册、登录、进入模型配置引导。</p></div>
+      <div class="flow-step"><span class="n">02</span><h4>授权服务</h4><p>管理账号、内部身份和 backend 注册。</p></div>
+      <div class="flow-step"><span class="n">03</span><h4>部署控制</h4><p>为用户创建独立 app-instance 容器。</p></div>
+      <div class="flow-step"><span class="n">04</span><h4>统一代理</h4><p>按实例域名把流量分发到对应容器。</p></div>
+      <div class="flow-step card-accent"><span class="n">05</span><h4>用户实例</h4><p>前端、后端、workspace、文件、技能和配置在实例内运行。</p></div>
+    </div>
+    <div class="grid g3 mt-l">
+      <div class="card card-accent"><h4>私有化友好</h4><p class="dim">最小部署基于 Linux/WSL2 + Docker，可放在企业自有环境或云主机。</p></div>
+      <div class="card card-accent"><h4>实例级隔离</h4><p class="dim">每个 app-instance 有自己的 workspace、配置和运行数据边界。</p></div>
+      <div class="card card-accent"><h4>供应商灵活</h4><p class="dim">模型 provider 可配置，支持后续成本、速度和质量策略。</p></div>
+    </div>
+    <div class="deck-footer"><span>deployment: auth portal + deploy control + routed app instances</span><span class="slide-number" data-current="7" data-total="13"></span></div>
+    <aside class="notes">
+      客户介绍中不需要展开所有代码细节，但要说明架构可信。Beaver 的多实例模式是：用户从认证门户进入，授权服务和部署控制创建独立实例，路由代理按域名分发流量。每个用户实例里有自己的前端、后端、workspace、技能和配置。客户能理解这是一个有边界的沙盒，而不是所有人混在一个共享会话里。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Trust And Governance">
+    <p class="kicker">trust and control</p>
+    <h2 class="h2">企业需要的不只是智能，还要可控、可解释、可治理。</h2>
+    <div class="metric-grid mt-l">
+      <div class="metric"><span>trace</span><b>证据链</b><p class="dim">任务、工具调用、产物和结果进入时间线。</p></div>
+      <div class="metric"><span>review</span><b>验收</b><p class="dim">用户可接受、要求修改或放弃任务。</p></div>
+      <div class="metric"><span>boundary</span><b>沙盒</b><p class="dim">文件与配置在实例边界内管理。</p></div>
+      <div class="metric"><span>admin</span><b>工具治理</b><p class="dim">MCP 工具可测试、编辑、启停和审查。</p></div>
+    </div>
+    <div class="split mt-l">
+      <div class="card card-accent">
+        <h3>对业务负责人</h3>
+        <ul class="clean mt-m">
+          <li>每个 AI 任务都有状态和产物。</li>
+          <li>结果不是默认正确，需要用户确认。</li>
+          <li>成功经验可沉淀为团队可复用能力。</li>
+        </ul>
+      </div>
+      <div class="card card-accent">
+        <h3>对 IT / 安全团队</h3>
+        <ul class="clean mt-m">
+          <li>部署控制面不直接暴露公网。</li>
+          <li>实例有独立 workspace 和配置边界。</li>
+          <li>工具、模型和连接器可按企业策略逐步接入。</li>
+        </ul>
+      </div>
+    </div>
+    <div class="deck-footer"><span>governance: evidence, acceptance, isolation, admin controls</span><span class="slide-number" data-current="8" data-total="13"></span></div>
+    <aside class="notes">
+      这一页强调企业采购最关心的风险问题。业务负责人关心能不能交付，IT 和安全团队关心能不能管控。Beaver 的回答是：任务有证据链，结果有验收，实例有边界，工具和连接器可治理。这样客户会觉得它不是一个不受控的 AI 黑盒，而是一个可纳入企业管理的系统。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Learning Loop">
+    <p class="kicker">learning moat</p>
+    <h2 class="h2">长期优势：越用越会做，把企业经验变成智能体资产。</h2>
+    <div class="pipeline mt-l">
+      <div class="phase card-accent"><span class="tag">memory</span><h3>长期记忆</h3><p class="dim">沉淀用户偏好、组织知识、历史任务、文件产物和工具经验。</p></div>
+      <div class="phase card-accent"><span class="tag">skills</span><h3>技能库</h3><p class="dim">把被认可的任务方法转为技能候选、草稿、审核和发布。</p></div>
+      <div class="phase card-accent"><span class="tag">marketplace</span><h3>市场与分发</h3><p class="dim">让团队安装、复用和管理已验证的技能与工具能力。</p></div>
+    </div>
+    <div class="panel mt-l">
+      <span class="tag good">客户价值</span>
+      <p class="lede mt-s">第一次交付依赖人工指导，第二次开始复用技能和记忆，长期形成企业自己的 AI 工作方法库。</p>
+    </div>
+    <div class="deck-footer"><span>compounding advantage: accepted work becomes reusable capability</span><span class="slide-number" data-current="9" data-total="13"></span></div>
+    <aside class="notes">
+      客户会问：这个系统的长期壁垒是什么？答案是学习闭环。普通工具每次都从头开始，而 Beaver 会把被用户认可的任务经验沉淀为技能，把稳定信息沉淀为记忆。后续类似任务可以自动激活已有方法。这让企业的 AI 能力随着使用增加，而不是永远停留在通用模型层。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Readiness">
+    <p class="kicker">current readiness</p>
+    <h2 class="h2">当前已具备的可展示能力，足够支撑客户试点。</h2>
+    <div class="matrix mt-l">
+      <div class="head">能力</div><div class="head">当前状态</div><div class="head">客户看到什么</div><div class="head">商业价值</div>
+      <div class="rowhead">任务执行闭环</div><div><span class="tag good">已完成</span></div><div>任务列表、详情、时间线、验收操作</div><div>从回答变成可交付结果</div>
+      <div class="rowhead">工具与证据</div><div><span class="tag good">已具备</span></div><div>文件、终端、网页、技能、定时任务等工具调用记录</div><div>可审计、可复盘</div>
+      <div class="rowhead">多智能体协作</div><div><span class="tag good">已具备</span></div><div>复杂任务拆分、子任务结果汇总</div><div>处理多阶段复杂工作</div>
+      <div class="rowhead">技能沉淀</div><div><span class="tag good">已具备</span></div><div>候选、草稿、评审、发布链路</div><div>形成企业技能库</div>
+      <div class="rowhead">长期记忆</div><div><span class="tag warn">底层已完成，待产品化接入</span></div><div>后续展示记忆管理台和检索轨迹</div><div>越用越懂业务</div>
+    </div>
+    <div class="deck-footer"><span>pilot-ready modules plus roadmap for memory/productization</span><span class="slide-number" data-current="10" data-total="13"></span></div>
+    <aside class="notes">
+      这一页要避免过度承诺，同时告诉客户可以试点。任务闭环、工具调用、证据留存、多智能体、技能沉淀这些已经具备展示基础。长期记忆底层能力已经完成，但仍需要接入主产品链路和 UI，因此对客户可以讲成下一阶段重点。这样既展示实力，也保持可信边界。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Business Value">
+    <p class="kicker">business value</p>
+    <h2 class="h2">客户收益：更快交付、更低风险、更强复用。</h2>
+    <div class="metric-grid mt-l">
+      <div class="metric"><span>speed</span><b>交付提速</b><p class="dim">将多步骤知识工作从人工串联变成 AI 协作执行。</p></div>
+      <div class="metric"><span>quality</span><b>过程透明</b><p class="dim">任务时间线和证据链降低黑盒风险。</p></div>
+      <div class="metric"><span>reuse</span><b>经验复用</b><p class="dim">技能和记忆让团队避免重复提示和重复摸索。</p></div>
+      <div class="metric"><span>control</span><b>成本可控</b><p class="dim">模型供应商可配置，为后续质量/成本路由打基础。</p></div>
+    </div>
+    <div class="split mt-l">
+      <div class="card card-accent">
+        <h3>适合先做试点的部门</h3>
+        <ul class="clean mt-m">
+          <li>需要频繁生成和修改交付物的项目团队。</li>
+          <li>重复处理文件、报告和知识资料的运营团队。</li>
+          <li>需要审计工具调用和任务证据的技术团队。</li>
+        </ul>
+      </div>
+      <div class="card card-accent">
+        <h3>成功指标建议</h3>
+        <ul class="clean mt-m">
+          <li>任务交付时间下降。</li>
+          <li>重复工作模板化比例提升。</li>
+          <li>人工修改轮次下降。</li>
+          <li>可追溯任务报告覆盖率提升。</li>
+        </ul>
+      </div>
+    </div>
+    <div class="deck-footer"><span>value: speed, governance, reuse, model flexibility</span><span class="slide-number" data-current="11" data-total="13"></span></div>
+    <aside class="notes">
+      这里把商业价值说得具体一点。不要只说提升效率，而要拆成可衡量指标：任务交付时间、重复工作模板化比例、修改轮次、可追溯报告覆盖率。客户如果要做试点，也需要这些指标判断是否成功。Beaver 的核心收益是更快交付、更低风险、更强复用。
+    </aside>
+  </section>
+
+  <section class="slide" data-title="Pilot Plan">
+    <p class="kicker">pilot plan</p>
+    <h2 class="h2">建议落地方式：先选高价值场景，4 步完成客户试点。</h2>
+    <div class="roadmap mt-l">
+      <div class="item"><span>01</span><b>场景选择</b><p>选择一个高频、跨工具、需要验收的部门流程，例如周报、方案交付或文件处理。</p></div>
+      <div class="item"><span>02</span><b>私有部署</b><p>在客户环境部署 Beaver，配置模型 provider、用户入口和实例域名。</p></div>
+      <div class="item"><span>03</span><b>工具接入</b><p>接入文件、搜索、邮件日历、MCP 工具或企业内部系统。</p></div>
+      <div class="item"><span>04</span><b>技能沉淀</b><p>把试点成功流程整理成技能，建立可复用的企业 Agent 模板。</p></div>
+    </div>
+    <div class="panel mt-l">
+      <span class="tag warn">推荐节奏</span>
+      <p class="lede mt-s">第一阶段先做 2-4 周试点，验证一个部门流程；第二阶段扩展连接器、权限策略和技能市场；第三阶段接入长期记忆管理。</p>
+    </div>
+    <div class="deck-footer"><span>pilot path: scenario, deploy, integrate, reuse</span><span class="slide-number" data-current="12" data-total="13"></span></div>
+    <aside class="notes">
+      客户方案要给落地路径。建议不要一开始全公司铺开，而是先挑一个高价值流程，2 到 4 周试点。先部署系统和模型，接入必要工具，再把成功流程沉淀成技能。试点成功后再扩展连接器、权限策略、市场和长期记忆管理。这样客户知道下一步怎么做。
+    </aside>
+  </section>
+
+  <section class="slide center tc" data-title="Closing">
+    <div>
+      <div class="center-mark">B</div>
+      <h2 class="h2 mt-m">Beaver Agent Sandbox</h2>
+      <p class="lede" style="margin-left:auto;margin-right:auto;">把企业 AI 从“会回答”升级为“能执行、可验收、可追踪、会复用”的智能体工作台。</p>
+      <div class="row mt-l" style="justify-content:center">
+        <span class="tag good">任务闭环</span>
+        <span class="tag">过程证据</span>
+        <span class="tag warn">私有沙盒</span>
+        <span class="tag">技能沉淀</span>
+      </div>
+    </div>
+    <div class="deck-footer"><span>Commercial proposal deck</span><span class="slide-number" data-current="13" data-total="13"></span></div>
+    <aside class="notes">
+      最后一页用于收束。可以把一句话再重复一遍：Beaver 让企业 AI 不止停留在回答，而是进入可执行任务、可验收结果、可追踪证据和可复用经验。随后进入客户问题讨论：他们最想先试点哪个场景、已有模型和工具是什么、部署环境有什么约束。
+    </aside>
+  </section>
+
+</div>
+<script src="assets/runtime.js"></script>
+</body>
+</html>
--- a/docs/presentations/skill-replay-eval/style.css
+++ b/docs/presentations/skill-replay-eval/style.css
@ -0,0 +1,511 @@
+/* Beaver Skill Replay Eval deck, based on html-ppt tech-sharing template. */
+.replay-root {
+  background: #08111d;
+}
+
+.tpl-beaver-replay {
+  --bg: #08111d;
+  --bg-soft: #0d1726;
+  --surface: #101b2c;
+  --surface-2: #132235;
+  --border: rgba(147, 197, 253, .18);
+  --border-strong: rgba(147, 197, 253, .34);
+  --text-1: #eef6ff;
+  --text-2: #a9bfd7;
+  --text-3: #6f879f;
+  --accent: #64e3a1;
+  --accent-2: #7cc7ff;
+  --accent-3: #d9a6ff;
+  --good: #64e3a1;
+  --warn: #ffd166;
+  --bad: #ff7b7b;
+  --grad: linear-gradient(120deg, #64e3a1 0%, #7cc7ff 52%, #d9a6ff 100%);
+  --radius: 8px;
+  --radius-sm: 6px;
+  --radius-lg: 8px;
+  --shadow: 0 20px 60px rgba(0, 0, 0, .38);
+  --letter-tight: 0;
+  --letter-normal: 0;
+  font-family: "Inter", "Noto Sans SC", sans-serif;
+  background: var(--bg);
+  color: var(--text-1);
+}
+
+.tpl-beaver-replay .slide {
+  padding: 50px 72px;
+  background:
+    linear-gradient(rgba(124, 199, 255, .04) 1px, transparent 1px),
+    linear-gradient(90deg, rgba(124, 199, 255, .04) 1px, transparent 1px),
+    linear-gradient(135deg, #08111d 0%, #0b1524 54%, #101426 100%);
+  background-size: 40px 40px, 40px 40px, auto;
+  color: var(--text-1);
+}
+
+.tpl-beaver-replay .slide::before {
+  content: "";
+  position: absolute;
+  inset: 0;
+  background:
+    linear-gradient(90deg, rgba(100, 227, 161, .1), transparent 22%, transparent 78%, rgba(124, 199, 255, .08)),
+    linear-gradient(180deg, rgba(217, 166, 255, .07), transparent 30%, transparent 82%, rgba(100, 227, 161, .05));
+  opacity: .7;
+  pointer-events: none;
+  z-index: 0;
+}
+
+.tpl-beaver-replay .slide > * {
+  position: relative;
+  z-index: 1;
+}
+
+.tpl-beaver-replay .h1 {
+  font-size: 62px;
+  line-height: 1.06;
+  font-weight: 800;
+  letter-spacing: 0;
+  margin-bottom: 20px;
+  color: #fff;
+}
+
+.tpl-beaver-replay .h2 {
+  font-size: 40px;
+  line-height: 1.12;
+  font-weight: 760;
+  letter-spacing: 0;
+  color: #fff;
+  margin-bottom: 16px;
+}
+
+.tpl-beaver-replay h3,
+.tpl-beaver-replay h4 {
+  color: #fff;
+  letter-spacing: 0;
+}
+
+.tpl-beaver-replay .lede {
+  font-size: 19px;
+  line-height: 1.55;
+  color: var(--text-2);
+  max-width: 66ch;
+}
+
+.tpl-beaver-replay .kicker {
+  color: var(--accent);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 13px;
+  font-weight: 700;
+  letter-spacing: 0;
+  text-transform: none;
+}
+
+.tpl-beaver-replay .kicker::before {
+  content: "> ";
+}
+
+.tpl-beaver-replay .mono {
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+}
+
+.tpl-beaver-replay .gradient-text {
+  background: var(--grad);
+  -webkit-background-clip: text;
+  background-clip: text;
+  -webkit-text-fill-color: transparent;
+  color: transparent;
+}
+
+.tpl-beaver-replay .deck-footer {
+  position: absolute;
+  left: 40px;
+  right: 40px;
+  bottom: 20px;
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  color: var(--text-3);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  letter-spacing: 0;
+}
+
+.tpl-beaver-replay .card {
+  background: rgba(16, 27, 44, .92);
+  border: 1px solid var(--border);
+  border-radius: var(--radius);
+  box-shadow: none;
+}
+
+.tpl-beaver-replay .card-accent {
+  border-top: 3px solid var(--accent);
+}
+
+.tpl-beaver-replay .panel {
+  padding: 20px 22px;
+  background: rgba(8, 17, 29, .68);
+  border: 1px solid var(--border);
+  border-radius: 8px;
+}
+
+.tpl-beaver-replay .tag,
+.tpl-beaver-replay .pill {
+  display: inline-flex;
+  align-items: center;
+  min-height: 24px;
+  padding: 4px 10px;
+  border-radius: 6px;
+  background: rgba(19, 34, 53, .9);
+  border: 1px solid var(--border);
+  color: var(--text-2);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 12px;
+  letter-spacing: 0;
+}
+
+.tpl-beaver-replay .tag.good {
+  color: var(--good);
+  border-color: rgba(100, 227, 161, .34);
+  background: rgba(100, 227, 161, .1);
+}
+
+.tpl-beaver-replay .tag.warn {
+  color: var(--warn);
+  border-color: rgba(255, 209, 102, .34);
+  background: rgba(255, 209, 102, .1);
+}
+
+.tpl-beaver-replay .tag.bad {
+  color: var(--bad);
+  border-color: rgba(255, 123, 123, .34);
+  background: rgba(255, 123, 123, .1);
+}
+
+.tpl-beaver-replay .terminal {
+  background: #050a12;
+  border: 1px solid var(--border);
+  border-radius: 8px;
+  overflow: hidden;
+  box-shadow: var(--shadow);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 14px;
+  line-height: 1.62;
+}
+
+.tpl-beaver-replay .terminal .bar {
+  display: flex;
+  align-items: center;
+  gap: 8px;
+  padding: 10px 14px;
+  background: #0d1726;
+  border-bottom: 1px solid var(--border);
+  font-size: 12px;
+  color: var(--text-3);
+}
+
+.tpl-beaver-replay .terminal .dot {
+  width: 10px;
+  height: 10px;
+  border-radius: 50%;
+  background: #ff7b7b;
+}
+
+.tpl-beaver-replay .terminal .dot:nth-child(2) {
+  background: #ffd166;
+}
+
+.tpl-beaver-replay .terminal .dot:nth-child(3) {
+  background: #64e3a1;
+}
+
+.tpl-beaver-replay .terminal pre {
+  margin: 0;
+  padding: 20px 22px;
+  color: #dbeafe;
+  overflow: auto;
+  max-height: 420px;
+}
+
+.tpl-beaver-replay .kw { color: #ff9f9f; }
+.tpl-beaver-replay .fn { color: #d9a6ff; }
+.tpl-beaver-replay .str { color: #9fe6ff; }
+.tpl-beaver-replay .num { color: #7cc7ff; }
+.tpl-beaver-replay .cmt { color: #6f879f; }
+
+.tpl-beaver-replay .speaker {
+  display: flex;
+  align-items: center;
+  gap: 14px;
+  margin-top: 22px;
+}
+
+.tpl-beaver-replay .speaker .av {
+  width: 54px;
+  height: 54px;
+  border-radius: 50%;
+  background: var(--grad);
+}
+
+.tpl-beaver-replay .speaker b {
+  display: block;
+  color: #fff;
+  font-size: 17px;
+}
+
+.tpl-beaver-replay .speaker span {
+  color: var(--text-3);
+  font-size: 13px;
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+}
+
+.tpl-beaver-replay .agenda-row {
+  display: grid;
+  grid-template-columns: 58px 1fr 190px;
+  gap: 22px;
+  align-items: baseline;
+  padding: 15px 0;
+  border-bottom: 1px dashed var(--border);
+}
+
+.tpl-beaver-replay .agenda-row .num {
+  color: var(--accent);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+}
+
+.tpl-beaver-replay .agenda-row .t {
+  color: #fff;
+  font-size: 22px;
+  font-weight: 700;
+}
+
+.tpl-beaver-replay .agenda-row .d {
+  color: var(--text-3);
+  font-size: 13px;
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+}
+
+.tpl-beaver-replay .flow {
+  display: grid;
+  grid-template-columns: repeat(5, 1fr);
+  gap: 12px;
+  align-items: stretch;
+}
+
+.tpl-beaver-replay .flow-step {
+  min-height: 120px;
+  padding: 16px;
+  border-radius: 8px;
+  background: rgba(16, 27, 44, .94);
+  border: 1px solid var(--border);
+}
+
+.tpl-beaver-replay .flow-step .n {
+  display: inline-block;
+  margin-bottom: 10px;
+  color: var(--accent);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 13px;
+}
+
+.tpl-beaver-replay .flow-step h4 {
+  margin: 0 0 8px;
+  font-size: 18px;
+}
+
+.tpl-beaver-replay .flow-step p {
+  margin: 0;
+  color: var(--text-2);
+  font-size: 14px;
+  line-height: 1.5;
+}
+
+.tpl-beaver-replay .split {
+  display: grid;
+  grid-template-columns: 1fr 1fr;
+  gap: 26px;
+  align-items: start;
+}
+
+.tpl-beaver-replay .compare {
+  display: grid;
+  grid-template-columns: 1fr 1fr;
+  gap: 18px;
+}
+
+.tpl-beaver-replay .compare .side {
+  min-height: 330px;
+  padding: 22px;
+  border: 1px solid var(--border);
+  border-radius: 8px;
+  background: rgba(8, 17, 29, .72);
+}
+
+.tpl-beaver-replay .compare .side.candidate {
+  border-color: rgba(100, 227, 161, .4);
+  background: rgba(100, 227, 161, .08);
+}
+
+.tpl-beaver-replay .metric-grid {
+  display: grid;
+  grid-template-columns: repeat(4, 1fr);
+  gap: 14px;
+}
+
+.tpl-beaver-replay .metric {
+  padding: 18px;
+  border-radius: 8px;
+  border: 1px solid var(--border);
+  background: rgba(16, 27, 44, .9);
+  min-height: 104px;
+}
+
+.tpl-beaver-replay .metric b {
+  display: block;
+  font-size: 23px;
+  line-height: 1.1;
+  color: #fff;
+}
+
+.tpl-beaver-replay .metric span {
+  color: var(--text-3);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 12px;
+}
+
+.tpl-beaver-replay .matrix {
+  display: grid;
+  grid-template-columns: 190px repeat(3, 1fr);
+  border: 1px solid var(--border);
+  border-radius: 8px;
+  overflow: hidden;
+}
+
+.tpl-beaver-replay .matrix > div {
+  min-height: 76px;
+  padding: 14px;
+  border-right: 1px solid var(--border);
+  border-bottom: 1px solid var(--border);
+  background: rgba(16, 27, 44, .78);
+  font-size: 14px;
+}
+
+.tpl-beaver-replay .matrix > div:nth-child(4n) {
+  border-right: 0;
+}
+
+.tpl-beaver-replay .matrix .head {
+  min-height: 48px;
+  color: var(--accent-2);
+  background: rgba(124, 199, 255, .08);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 12px;
+}
+
+.tpl-beaver-replay .matrix .rowhead {
+  color: #fff;
+  font-weight: 700;
+}
+
+.tpl-beaver-replay .pipeline {
+  display: grid;
+  grid-template-columns: 1fr 1fr 1fr;
+  gap: 18px;
+}
+
+.tpl-beaver-replay .pipeline .phase {
+  min-height: 250px;
+  padding: 22px;
+  border-radius: 8px;
+  border: 1px solid var(--border);
+  background: rgba(16, 27, 44, .88);
+}
+
+.tpl-beaver-replay .phase h3 {
+  font-size: 22px;
+  margin-bottom: 12px;
+}
+
+.tpl-beaver-replay ul.clean {
+  list-style: none;
+  padding: 0;
+  margin: 0;
+}
+
+.tpl-beaver-replay ul.clean li {
+  position: relative;
+  padding-left: 18px;
+  margin: 10px 0;
+  color: var(--text-2);
+  font-size: 15px;
+  line-height: 1.45;
+}
+
+.tpl-beaver-replay ul.clean li::before {
+  content: "";
+  position: absolute;
+  left: 0;
+  top: .65em;
+  width: 7px;
+  height: 7px;
+  border-radius: 50%;
+  background: var(--accent);
+}
+
+.tpl-beaver-replay .large-number {
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 96px;
+  line-height: .9;
+  font-weight: 800;
+  color: var(--accent);
+}
+
+.tpl-beaver-replay .source-line {
+  color: var(--text-3);
+  font-size: 12px;
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+}
+
+.tpl-beaver-replay .roadmap {
+  display: grid;
+  grid-template-columns: repeat(4, 1fr);
+  gap: 14px;
+}
+
+.tpl-beaver-replay .roadmap .item {
+  min-height: 150px;
+  padding: 18px;
+  background: rgba(16, 27, 44, .9);
+  border: 1px solid var(--border);
+  border-radius: 8px;
+}
+
+.tpl-beaver-replay .roadmap .item b {
+  display: block;
+  margin-bottom: 10px;
+  color: #fff;
+  font-size: 18px;
+}
+
+.tpl-beaver-replay .roadmap .item span {
+  color: var(--accent);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 12px;
+}
+
+.tpl-beaver-replay .roadmap .item p {
+  color: var(--text-2);
+  font-size: 14px;
+  line-height: 1.5;
+}
+
+.tpl-beaver-replay .center-mark {
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  width: 112px;
+  height: 112px;
+  border-radius: 50%;
+  border: 1px solid rgba(100, 227, 161, .38);
+  background: rgba(100, 227, 161, .08);
+  color: var(--accent);
+  font-family: "JetBrains Mono", "IBM Plex Mono", monospace;
+  font-size: 46px;
+  font-weight: 800;
+}
--- a/docs/product-discovery/README.md
+++ b/docs/product-discovery/README.md
@ -0,0 +1,11 @@
+# Product Discovery
+
+Product discovery artifacts for Beaver.
+
+## Main Product
+
+- [Beaver Agent Sandbox](./beaver/README.md)
+
+## Feature-Level Discovery
+
+- [Skill Replay Eval](./skill-replay-eval/README.md)
--- a/docs/product-discovery/beaver/PRD-beaver-agent-sandbox.md
+++ b/docs/product-discovery/beaver/PRD-beaver-agent-sandbox.md
@ -0,0 +1,489 @@
+# PRD: Beaver Agent Sandbox
+
+Date: 2026-06-09
+
+Status: Product discovery draft for whole Beaver product
+
+## 1. Summary
+
+Beaver Agent Sandbox is a private-deployable workspace for enterprise Agent work. It lets users move from chat to managed tasks, execute work with files and tools, track evidence, accept or revise outputs, and turn successful work into reusable skills and memory.
+
+The first product goal is to prove that Beaver can help a pilot team complete repeatable knowledge work with more control, traceability, and reuse than chat-only AI tools.
+
+## 2. Contacts
+
+| Role | Owner | Comment |
+| --- | --- | --- |
+| Product owner | TBD | Owns positioning, roadmap, pilot metrics, research |
+| Engineering owner | TBD | Owns platform architecture and implementation quality |
+| Design owner | TBD | Owns workspace, task, review, admin, and onboarding UX |
+| Deployment owner | TBD | Owns Docker deployment, routing, instance lifecycle |
+| Security/review owner | TBD | Owns tool policy, data boundaries, connector safety |
+| Pilot owner | TBD | Owns customer/team selection and feedback loop |
+
+## 3. Background
+
+Most enterprise AI experiments start with chat. Chat is useful, but it is weak at real work:
+
+- There is no durable task lifecycle.
+- It is hard to see what happened.
+- File and tool work is scattered.
+- Results are not formally accepted or rejected.
+- Successful workflows are not turned into reusable team capability.
+- Admins cannot easily control deployment, tools, memory, and connectors.
+
+Beaver addresses this gap by acting as an Agent execution and governance layer. It combines a user workspace, task runtime, evidence timeline, file and tool operations, skill learning, scheduled work, connectors, and private multi-instance deployment.
+
+Why now:
+
+- Teams are moving from AI demos to operational AI workflows.
+- Enterprise buyers need governance, not only model access.
+- Beaver already has enough implementation to support pilot workflows.
+- The next step is product packaging, validation, and operational hardening.
+
+## 4. Objective
+
+### Objective
+
+Prove Beaver can deliver trusted, repeatable Agent work for pilot teams.
+
+### Key Results
+
+| Key Result | Target |
+| --- | --- |
+| Time to first accepted task | Pilot user reaches first accepted task within first session |
+| Accepted Agent Workflows | >=30 accepted tasks across pilot team within 30 days |
+| Acceptance Rate | >=60% of completed task runs accepted |
+| Evidence Coverage | >=90% of task runs show useful timeline/tool/artifact evidence |
+| Skill Reuse | >=5 reusable skills created, >=3 reused at least twice |
+| Deployment Repeatability | Fresh pilot deployment under 2 hours with documented steps |
+| Critical Incidents | 0 control-plane exposure, data leakage, or unintended external-write incidents |
+
+## 5. Market Segments
+
+### Primary Segment: Enterprise Teams Doing Repeatable Knowledge Work
+
+Examples:
+
+- Project delivery teams.
+- Operations teams.
+- Internal strategy/research teams.
+- Technical support and engineering teams.
+- Customer success and sales operations teams.
+
+Their work is a good fit when it is:
+
+- Repeated often.
+- Multi-step.
+- File-heavy.
+- Tool-heavy.
+- Needs review or approval.
+- Benefits from a traceable process.
+
+### Buyer Segment: AI Platform Owner / IT Leader
+
+They need to provide AI capability without losing control over deployment, data, tools, and governance.
+
+### Admin Segment: Operator / Implementation Owner
+
+They set up Beaver, manage model providers, monitor health, handle connectors, and support users.
+
+### Maintainer Segment: Skill Owner
+
+They curate reusable skills and make sure published skills are safe, useful, and reviewable.
+
+## 6. Value Propositions
+
+### For Workflow Teams
+
+Beaver turns AI conversations into managed work. A request can become a task, produce artifacts, show evidence, and continue through revision until accepted.
+
+### For Platform Owners
+
+Beaver offers a private Agent sandbox with instance boundaries, tool governance, skills, and operational controls.
+
+### For Admins
+
+Beaver makes onboarding and operations more repeatable through auth portal, deploy control, routing, settings, status, and logs.
+
+### For Skill Maintainers
+
+Beaver turns accepted work into reusable skills through candidate, draft, safety/eval, review, and publish flow.
+
+### For End Users
+
+Beaver gives one place to chat, upload files, run tasks, preview outputs, review results, and reuse proven methods.
+
+## 7. Solution
+
+### 7.1 User Experience
+
+#### First-Run Experience
+
+```text
+User registers
+  -> app instance is created
+  -> user configures model provider
+  -> user enters Beaver workspace
+  -> user starts from a workflow template or chat
+  -> Beaver creates or continues a task
+  -> user accepts or revises the result
+```
+
+Requirements:
+
+- Registration and instance provisioning must show clear progress and errors.
+- Provider setup must be understandable and recoverable.
+- If provider setup is skipped, the app must clearly explain why Agent calls cannot run.
+
+#### Daily User Workspace
+
+Primary screens:
+
+- Chat workbench.
+- Task list and task details.
+- Files.
+- Notifications and scheduled work.
+- Skills and marketplace.
+- Tool management.
+- Settings/status/logs.
+
+Core user loop:
+
+```text
+Ask
+  -> execute
+  -> inspect evidence
+  -> accept/revise
+  -> reuse
+```
+
+#### Admin Experience
+
+Admin needs:
+
+- See instance health.
+- Configure providers.
+- Configure channels/connectors.
+- Restart safely.
+- Inspect logs.
+- Manage tools and skills.
+- Understand failures.
+
+### 7.2 Key Features
+
+#### Authentication And Instance Provisioning
+
+Requirements:
+
+- Users register or log in through auth portal.
+- Registration triggers an app-instance container.
+- Router maps instance host to container.
+- Provider onboarding can configure model provider after instance creation.
+
+Acceptance criteria:
+
+- New user can reach a working instance.
+- Failed provisioning shows a recoverable error.
+- `deploy-control` and `authz-service` are not public surfaces.
+
+#### Chat Workbench
+
+Requirements:
+
+- Users can create/select sessions.
+- Users can send text and attachments.
+- Users can see Assistant messages, task cards, Agent run progress, and acceptance controls.
+- Users can jump from chat to task detail.
+
+Acceptance criteria:
+
+- User can complete one full chat-to-task-to-accept flow.
+- Attachments can be uploaded and used.
+- Current task status is visible.
+
+#### Task Lifecycle
+
+Requirements:
+
+- System can distinguish ordinary chat from task requests.
+- Task can be created, run, continued, revised, accepted, abandoned, or deleted.
+- Task detail shows timeline, runs, tools, artifacts, result, and acceptance controls.
+
+Acceptance criteria:
+
+- Task list and detail remain useful on mobile and desktop.
+- Acceptance actions are persisted.
+- Revision feedback continues the same task context.
+
+#### Agent Team Execution
+
+Requirements:
+
+- Complex tasks can be planned as sequence, parallel, or DAG execution.
+- Subtasks can bind skills or ephemeral guidance.
+- Main Agent synthesizes final answer from evidence.
+
+Acceptance criteria:
+
+- Subtask results are visible and debuggable.
+- Failed team execution is shown without hiding partial evidence.
+
+#### Files Workspace
+
+Requirements:
+
+- Users can upload, create folders, browse, preview, download, and delete files.
+- Workspace roots stay understandable.
+- File operations are safe within instance boundaries.
+
+Acceptance criteria:
+
+- Root and nested directories work.
+- Text/Markdown/image preview works.
+- Long file names do not break layout.
+
+#### Tools And MCP
+
+Requirements:
+
+- Admins can view, test, add, edit, delete, and refresh tools where supported.
+- Agent runtime can expose tools based on task and skill context.
+- Tool calls are recorded as evidence.
+
+Acceptance criteria:
+
+- Tool detail and test flows work.
+- Dangerous tools are governed by policy before broad rollout.
+
+#### Skills And Marketplace
+
+Requirements:
+
+- Published skills can be listed, inspected, installed, uploaded, disabled, rolled back, or deleted where supported.
+- Accepted work can create skill candidates.
+- Candidates can become drafts.
+- Drafts require safety/eval/review gates before publish.
+- Marketplace supports discovery and install.
+
+Acceptance criteria:
+
+- Candidate and draft flows do not reset UI state unexpectedly.
+- Publish requires review gates.
+- Published skill can be reused by later tasks.
+
+#### Memory
+
+Requirements:
+
+- Beaver can store long-term preferences, business knowledge, historical task learning, file/artifact memory, tool experience, and reusable workflows.
+- Before broad product use, users/admins need memory inspect/edit/delete/freeze controls.
+
+Acceptance criteria for Memory Control Center MVP:
+
+- User can see what is remembered.
+- User can see source and last-used context.
+- User can edit, delete, or freeze memory.
+- Task detail can show when memory affected execution.
+
+#### Scheduled Work And Notifications
+
+Requirements:
+
+- Users can create scheduled jobs.
+- Scheduled runs can produce notifications or tasks.
+- Users can review, revise, or accept scheduled outputs.
+
+Acceptance criteria:
+
+- Scheduled job can be created, toggled, run now, deleted.
+- Scheduled output can enter normal task review flow.
+
+#### Connectors
+
+Requirements:
+
+- Beaver can connect to external systems such as Outlook and selected IM/channel connectors.
+- Connector status, setup, errors, and reconnect path must be visible.
+- External writes require clear policy and safety boundary.
+
+Acceptance criteria:
+
+- Pilot-safe connector list is documented.
+- External connector callbacks route correctly in multi-instance deployment.
+- Failed connector auth or setup is recoverable.
+
+#### Settings, Status, Logs
+
+Requirements:
+
+- Users/admins can configure provider, Agent settings, channels, and runtime.
+- Status page shows current app health.
+- Logs help operators diagnose failures.
+- Restart is confirmed before execution.
+
+Acceptance criteria:
+
+- Provider save flow works.
+- Runtime restart flow is protected by confirmation.
+- Long config values do not break UI.
+
+### 7.3 Technology
+
+Frontend:
+
+- Next.js app inside `app-instance/frontend`.
+- App shell with chat, tasks, files, skills, marketplace, tools, connectors, settings, status, logs.
+
+Backend:
+
+- Python Beaver backend inside `app-instance/backend`.
+- Unified `beaver.engine` for Agent runtime.
+- `beaver.coordinator` for multi-agent execution.
+- `beaver.services` for task, cron, process, and application orchestration.
+- `beaver.tools` for built-in/MCP tool execution.
+- `beaver.skills` for skill loading, learning, review, publishing.
+- `beaver.memory` for run memory, skills memory, long-term memory foundation.
+- `beaver.interfaces` for web, MCP, channels, CLI/gateway surfaces.
+
+Deployment:
+
+- `auth-portal`.
+- `authz-service`.
+- `deploy-control`.
+- `router-proxy`.
+- `app-instance`.
+- Docker network and per-instance mounted runtime directories.
+
+### 7.4 Data And Evidence
+
+Important product data:
+
+- Users and auth handoff.
+- Instance registry.
+- Provider configuration.
+- Conversations and messages.
+- Tasks, task runs, run events, timeline events.
+- Tool calls and results.
+- Files and artifacts.
+- Skill receipts, candidates, drafts, safety/eval reports, reviews, published versions.
+- Memory records.
+- Scheduled jobs and scheduled runs.
+- Connector state and events.
+
+Evidence principle:
+
+Every meaningful Agent action should become explainable later.
+
+### 7.5 Assumptions
+
+- The best first customers are teams with repeatable knowledge workflows.
+- Task acceptance is the right primary quality signal.
+- Private deployment is a benefit, not a barrier, for early enterprise pilots.
+- Teams will value skill/memory reuse after enough accepted tasks.
+- Admins can operate a Docker-based deployment with a clear runbook.
+- Memory must be controllable before it can be trusted.
+
+### 7.6 Non-Goals For First Pilot
+
+- Broad public SaaS launch.
+- Full multi-tenant organization management.
+- Fully autonomous skill publishing.
+- Production external writes without clear review.
+- Complete enterprise RBAC.
+- Unlimited connector support.
+- Perfect long-term memory automation.
+- Replacing human review for high-risk work.
+
+## 8. Release
+
+### Release 0: Internal Demo Readiness
+
+Scope:
+
+- Clean local deployment.
+- Auth portal registration/login.
+- Provider onboarding.
+- Chat-to-task demo.
+- Task detail evidence.
+- File upload/preview.
+- Skills and marketplace demo.
+- Settings/status/logs.
+
+Exit criteria:
+
+- Demo flow works on fresh environment.
+- Known limitations are documented.
+- No critical security/deployment issue.
+
+### Release 1: Pilot Workflow Release
+
+Scope:
+
+- 2-3 packaged workflows.
+- Task acceptance and evidence as main flow.
+- Files and selected tools.
+- Basic scheduled workflow.
+- One pilot-safe connector if stable.
+- Skill candidate/draft/review/publish.
+- Deployment runbook and support checklist.
+
+Exit criteria:
+
+- Pilot team reaches >=30 accepted tasks in 30 days.
+- >=5 reusable skills created.
+- 0 critical incidents.
+- Deployment under 2 hours on fresh host.
+
+### Release 2: Governance And Reuse Release
+
+Scope:
+
+- Evidence narrative.
+- Memory Control Center.
+- Skill replay/eval governance.
+- Admin health console.
+- Connector policy hardening.
+- Pilot scorecard.
+
+Exit criteria:
+
+- Reviewers understand evidence.
+- Users can inspect and control memory.
+- Admins can diagnose provider/connector/runtime issues.
+- Skill reuse is visible in metrics.
+
+### Release 3: Expansion Release
+
+Scope:
+
+- Team/workspace concepts if validated.
+- More connectors.
+- Audit export.
+- Cross-instance analytics.
+- Policy profiles.
+- Instance lifecycle automation.
+
+Exit criteria:
+
+- Multiple teams can run without high support load.
+- Governance story supports enterprise buying process.
+
+## Open Questions
+
+- Is the first paying segment project teams, operations teams, engineering/support, or internal AI platform teams?
+- Should Beaver optimize for single-user instances first or team workspaces sooner?
+- Which connector is the safest and most valuable pilot connector?
+- What exact tool policy should apply in customer pilots?
+- What memory behavior should be on by default?
+- How much raw evidence should normal users see versus admins?
+- What is the backup/restore SLA for app instances?
+
+## Success Review Checklist
+
+- Can a new user get to first accepted task quickly?
+- Can a reviewer understand what the Agent did?
+- Can an admin recover from provider or connector errors?
+- Can a successful task become a reusable skill?
+- Can a pilot owner prove value with metrics?
+- Can security explain the deployment and tool boundaries?
--- a/docs/product-discovery/beaver/README.md
+++ b/docs/product-discovery/beaver/README.md
@ -0,0 +1,30 @@
+# Beaver Product Discovery
+
+This folder covers Beaver as the whole product, not only one feature.
+
+Beaver is an enterprise Agent sandbox and execution platform. It combines private deployment, per-user app instances, chat-to-task execution, task evidence, user acceptance, files, tools, skills, memory, connectors, scheduled work, and governance.
+
+## Documents
+
+- [Business Strategy HTML](./index.html): business-style product discovery, strategy canvas, target users, segmentation, and competitors.
+- [Product PRD HTML](./product-prd.html): product PRD, outcome roadmap, module job stories, WWA backlog items, and test scenarios.
+- [Product Discovery Report](./product-discovery-report.md): product understanding, users, JTBD, opportunities, assumptions, experiments, priorities, metrics, and 30/90 day recommendations.
+- [Product Architecture Brief](./product-architecture-brief.md): product-facing architecture across auth, deployment control, routing, app instances, frontend, backend, Agent runtime, tools, skills, memory, files, connectors, and operations.
+- [PRD](./PRD-beaver-agent-sandbox.md): full-product PRD for the Beaver Agent Sandbox.
+- [Validation Plan](./validation-plan.md): customer, product, technical, security, usability, and business validation plan.
+- [Launch And Maintenance Runbook](./launch-maintenance-runbook.md): launch phases, readiness checks, monitoring, incident response, maintenance cadence, and rollback.
+
+## Source Material
+
+- [Project README](../../../README.md)
+- [Deployment Guide](../../../部署指南.md)
+- [Domain Guide](../../../域名配置指引.md)
+- [App Instance README](../../../app-instance/README.md)
+- [Backend README](../../../app-instance/backend/README.md)
+- [Recent Backend Features](../../../projcet_review/backend_recent_completed_features.md)
+- [UI/UX Page Docs](../../ui-ux/README.md)
+- [Customer Presentation](../../presentations/skill-replay-eval/index.html)
+
+## Related Feature Discovery
+
+- [Skill Replay Eval Discovery](../skill-replay-eval/README.md)
--- a/docs/product-discovery/beaver/index.html
+++ b/docs/product-discovery/beaver/index.html
--- a/docs/product-discovery/beaver/launch-maintenance-runbook.md
+++ b/docs/product-discovery/beaver/launch-maintenance-runbook.md
@ -0,0 +1,455 @@
+# Beaver Launch And Maintenance Runbook
+
+Date: 2026-06-09
+
+Scope: whole Beaver product.
+
+## 1. Launch Principle
+
+Launch Beaver through controlled pilots before broad rollout.
+
+The product has a wide operational surface: auth, deployment control, routing, per-instance app containers, model providers, Agent runtime, tools, files, skills, memory, scheduled work, and connectors. A successful launch depends as much on reliability and trust as on feature completeness.
+
+## 2. Launch Roles
+
+| Role | Responsibility |
+| --- | --- |
+| Launch owner | Owns readiness, go/no-go, rollout phases |
+| Deployment owner | Owns Docker images, network, router, instance lifecycle |
+| Backend owner | Owns Agent runtime, tasks, tools, skills, cron, APIs |
+| Frontend owner | Owns user-facing flows and UI verification |
+| Security owner | Owns control-plane exposure, data boundaries, tool/connector policy |
+| Pilot owner | Owns user onboarding, workflow selection, feedback, metrics |
+| Support owner | Owns incident triage, runbook updates, user support |
+
+## 3. Launch Phases
+
+### Phase 0: Local Internal Readiness
+
+Audience: builders and internal testers.
+
+Goals:
+
+- Full local deployment works.
+- Core demo flows are stable.
+- Known risks are documented.
+
+Required flows:
+
+- Register/login.
+- Provider onboarding.
+- First chat response.
+- Chat-to-task.
+- Task acceptance/revision.
+- File upload/preview/download/delete.
+- Skill list/candidate/draft/review.
+- Settings/status/restart.
+
+Exit criteria:
+
+- Fresh deployment run completed from docs.
+- No P0 or launch-blocking P1 issues.
+- Demo script works end to end.
+
+### Phase 1: Controlled Pilot
+
+Audience: one internal team or one trusted customer team.
+
+Goals:
+
+- Validate real workflow value.
+- Validate deployment and support process.
+- Validate trust, evidence, and governance story.
+
+Constraints:
+
+- Narrow workflow scope.
+- Narrow connector scope.
+- Conservative tool policy.
+- Human review for skill publishing.
+- No opaque memory use for sensitive data.
+
+Exit criteria:
+
+- >=30 accepted tasks in 30 days.
+- >=2 recurring workflows.
+- 0 critical incidents.
+- Deployment/support issues documented and reduced.
+
+### Phase 2: Expanded Pilot
+
+Audience: more users in same team or a second pilot team.
+
+Goals:
+
+- Test repeatability across workflows.
+- Introduce Memory Control Center or stricter memory policy if ready.
+- Strengthen skill reuse and scheduled work.
+
+Exit criteria:
+
+- Skill reuse becomes visible.
+- Admin can operate without developer pairing for common tasks.
+- Evidence and report quality are accepted by workflow owner.
+
+### Phase 3: Production Candidate
+
+Audience: broader customer or department rollout.
+
+Goals:
+
+- Stabilized deployment.
+- Health monitoring.
+- Incident response.
+- Backup/restore process.
+- Policy profiles.
+
+Exit criteria:
+
+- Launch owner, security owner, and deployment owner approve.
+- Support process has clear ownership.
+- Rollback and restore are rehearsed.
+
+## 4. Pre-Launch Checklist
+
+### Deployment
+
+- [ ] Images build successfully.
+- [ ] Docker network exists.
+- [ ] Router proxy starts.
+- [ ] AuthZ service starts.
+- [ ] Deploy control starts.
+- [ ] Auth portal starts.
+- [ ] App instance can be created.
+- [ ] App instance route works through router proxy.
+- [ ] Provider config can be written and instance restarted.
+- [ ] Runtime directories are persistent.
+- [ ] Public exposure limited to intended services.
+
+### Product Flows
+
+- [ ] Register/login works.
+- [ ] Provider onboarding works.
+- [ ] Chat workbench loads.
+- [ ] Task creation works.
+- [ ] Task detail timeline works.
+- [ ] Acceptance/revision/abandon works.
+- [ ] Files page works.
+- [ ] Tools page works for pilot tools.
+- [ ] Skills page works.
+- [ ] Marketplace install works if included.
+- [ ] Cron/scheduled flow works if included.
+- [ ] Connector flow works if included.
+- [ ] Settings/status/logs work.
+
+### Governance
+
+- [ ] Tool policy for pilot is documented.
+- [ ] Connector side effects are understood.
+- [ ] Skill publish gates are documented.
+- [ ] Memory behavior is documented.
+- [ ] Data retention expectations are documented.
+- [ ] User-facing limitations are documented.
+
+### Support
+
+- [ ] Pilot support channel exists.
+- [ ] Incident owner assigned.
+- [ ] Logs and health checks are accessible.
+- [ ] Backup/restore expectations are clear.
+- [ ] Known issues list exists.
+
+## 5. Monitoring
+
+### Product Metrics
+
+| Metric | Owner | Cadence |
+| --- | --- | --- |
+| Accepted tasks | Pilot owner | Weekly |
+| Acceptance rate | Product owner | Weekly |
+| Revision rate | Product owner | Weekly |
+| Active workflows | Pilot owner | Weekly |
+| Skill candidates and reuse | Product owner | Weekly |
+| Scheduled run success | Backend owner | Weekly |
+| Time to first accepted task | Product/design | Per onboarding |
+
+### Operational Metrics
+
+| Metric | Owner | Alert |
+| --- | --- | --- |
+| Instance creation failures | Deployment owner | >10% during pilot |
+| Router route failures | Deployment owner | Any repeated failure |
+| Provider setup failures | Support owner | >20% of onboarded users |
+| Task run failures | Backend owner | >20% for 2 days |
+| WebSocket/runtime disconnects | Backend/frontend | Repeated user-visible failures |
+| File operation failures | Backend owner | Any permission/path issue |
+| Tool execution failures | Backend owner | Repeated by tool category |
+| Cron failures | Backend owner | Any critical scheduled workflow missed |
+| Connector failures | Integration owner | Failed auth or unintended write |
+
+### Security Metrics
+
+| Metric | Alert |
+| --- | --- |
+| Control-plane public exposure | Immediate P0 |
+| Cross-instance data access | Immediate P0 |
+| Unintended external write | Immediate P0 |
+| Credential leak in logs/report | Immediate P0 |
+| Unsafe skill publish | P1, or P0 if external action risk |
+
+## 6. Health Checks
+
+### Control Plane
+
+- Auth portal reachable.
+- AuthZ service reachable internally.
+- Deploy control reachable internally with token.
+- Router proxy has generated routes.
+- Instance registry is readable and consistent.
+
+### App Instance
+
+- Frontend loads.
+- Backend `/api/status` responds.
+- WebSocket works.
+- Provider config present.
+- Workspace path mounted.
+- Initial skills present.
+- Logs accessible.
+
+### Product Runtime
+
+- Chat request succeeds.
+- Task run succeeds.
+- File API succeeds.
+- Tool registry loads.
+- Skills list loads.
+- Cron scheduler active if enabled.
+- Connector status loads if enabled.
+
+## 7. Incident Response
+
+### P0: Control Plane Exposed
+
+Examples:
+
+- `deploy-control` accessible from public internet.
+- `authz-service` accessible from public internet.
+- Internal token leaked.
+
+Actions:
+
+1. Remove public route/firewall exposure.
+2. Rotate affected tokens.
+3. Review access logs.
+4. Confirm no unauthorized instance operations.
+5. Update deployment checklist.
+
+### P0: Cross-Instance Data Leak
+
+Examples:
+
+- Instance A reads Instance B workspace.
+- Router sends user to wrong instance.
+- Shared connector callback writes to wrong instance.
+
+Actions:
+
+1. Disable affected route or instance.
+2. Preserve logs and registry.
+3. Identify path/host/callback mapping failure.
+4. Patch and add regression test.
+5. Notify affected stakeholders.
+
+### P0: Unintended External Action
+
+Examples:
+
+- Email or IM message sent unexpectedly.
+- Calendar invite created unexpectedly.
+- External system updated without user intent.
+
+Actions:
+
+1. Disable connector or tool.
+2. Preserve task/tool evidence.
+3. Identify initiating task, tool, arguments, user, connector account.
+4. Patch policy or confirmation gate.
+5. Add test case and update pilot policy.
+
+### P1: New User Cannot Reach Instance
+
+Actions:
+
+1. Check auth portal logs.
+2. Check authz register flow.
+3. Check deploy-control register/configure flow.
+4. Check instance registry.
+5. Check router route generation.
+6. Check container state and app logs.
+
+### P1: Provider Config Broken
+
+Actions:
+
+1. Check settings/status.
+2. Confirm config path and provider fields.
+3. Test provider credentials.
+4. Restart instance if config was changed.
+5. Improve onboarding copy if user error.
+
+### P1: Task Runtime Failing
+
+Actions:
+
+1. Check backend logs.
+2. Check provider availability.
+3. Check tool registry.
+4. Check task event timeline.
+5. Reproduce with minimal chat request.
+6. Mark affected pilot workflow as paused if repeated.
+
+### P2: UI Flow Confusing
+
+Actions:
+
+1. Record screen and user quote.
+2. Add to UX issue list.
+3. Determine whether it blocks pilot success.
+4. Fix copy/layout if low effort.
+
+## 8. Maintenance Cadence
+
+### Daily During Pilot
+
+- Check critical incidents.
+- Check instance health.
+- Check failed task runs.
+- Check support channel.
+- Review provider/connector errors.
+
+### Weekly
+
+- Review accepted tasks and acceptance rate.
+- Review workflow success/failure.
+- Review skill candidates and reuse.
+- Review deployment issues.
+- Review security/tool/connector events.
+- Update known issues and runbook.
+
+### Monthly
+
+- Rehearse fresh deployment.
+- Review backup/restore approach.
+- Review memory and skill governance.
+- Review connector roadmap.
+- Review pilot ROI and expansion decision.
+
+### Quarterly
+
+- Revisit product positioning.
+- Revisit architecture scaling assumptions.
+- Decide team workspace / RBAC roadmap.
+- Review security model and policy profiles.
+
+## 9. Backup And Restore
+
+Minimum data to preserve:
+
+- `authz-service/runtime/data`
+- `app-instance/runtime/instances`
+- `app-instance/runtime/registry`
+- `router-proxy/runtime/conf.d`
+
+Per instance:
+
+- `beaver-home/config.json`
+- `beaver-home/web_auth_users.json`
+- `beaver-home/workspace/`
+- skill and runtime state under instance data.
+
+Pilot requirements:
+
+- Document manual backup command.
+- Document manual restore procedure.
+- Test restore for at least one non-production instance before expanded pilot.
+
+## 10. Change Management
+
+Before changing any of these, require launch owner review:
+
+- Routing/proxy config.
+- AuthZ issuer/internal URL.
+- Deploy token names or values.
+- Instance registry format.
+- Workspace mount paths.
+- Provider config schema.
+- Tool execution policy.
+- Connector callback routing.
+- Skill publish gates.
+- Memory default behavior.
+
+## 11. Rollback
+
+Rollback options:
+
+- Roll back frontend/backend image for app instances.
+- Disable specific connector.
+- Disable scheduled job execution.
+- Disable skill learning worker.
+- Disable skill publish.
+- Fall back to chat-only mode for affected workflow.
+- Remove public route to affected instance.
+- Restore instance data from backup.
+
+Rollback triggers:
+
+- P0 incident.
+- Repeated instance creation failure.
+- Repeated task runtime failure blocking pilot work.
+- Provider config issue affecting most users.
+- Connector side-effect risk.
+- UI issue blocking first accepted task.
+
+## 12. Launch Communication
+
+### Internal
+
+Beaver is launching as a controlled Agent execution pilot. The launch goal is not maximum feature breadth. The goal is to prove repeatable AI-assisted work with task acceptance, evidence, and reuse.
+
+### Pilot Users
+
+Use Beaver for selected workflows where you need a concrete output. Review each result. Accept it if usable, request revision if it is close, or abandon it if it is not worth continuing. Your feedback is the signal that helps Beaver improve and reuse work.
+
+### Admins
+
+Treat Beaver as an app platform with a control plane and per-instance runtime. Keep deploy-control and authz private. Monitor instance health, provider config, tool behavior, and connector side effects.
+
+## 13. Known Limitations To Disclose
+
+- Memory is not yet fully productized with user controls.
+- Connector maturity varies by provider.
+- The first pilot should use a narrow set of workflows.
+- Some operations may still require engineering support.
+- Skill learning needs human review before publish.
+- Multi-user organization features are not the first pilot focus.
+
+## 14. Go / No-Go Criteria
+
+Go if:
+
+- Fresh deployment works.
+- First accepted task flow works.
+- Evidence timeline is readable enough for pilot.
+- Tool and connector policy is documented.
+- Support owner is assigned.
+- No critical security issue is open.
+
+No-go if:
+
+- Control-plane exposure risk is unresolved.
+- Cross-instance isolation is unverified.
+- Provider onboarding fails for most users.
+- Task runtime is unreliable.
+- Pilot workflow is not defined.
+- No one owns incidents or support.
--- a/docs/product-discovery/beaver/product-architecture-brief.md
+++ b/docs/product-discovery/beaver/product-architecture-brief.md
@ -0,0 +1,439 @@
+# Beaver Product Architecture Brief
+
+Date: 2026-06-09
+
+Audience: product, engineering, delivery, security, and pilot stakeholders.
+
+## 1. Architecture Summary
+
+Beaver is built as a private-deployable, multi-instance Agent workspace.
+
+At the top level, it has five deployment components:
+
+```text
+Browser
+  -> auth-portal
+  -> authz-service
+  -> deploy-control
+  -> router-proxy
+  -> app-instance
+```
+
+Each `app-instance` contains the user-facing product:
+
+```text
+app-instance container
+  -> Nginx
+  -> Next.js frontend
+  -> Beaver backend
+  -> mounted beaver-home
+       -> config
+       -> workspace
+       -> skills
+       -> runtime data
+```
+
+The key product architecture choice is instance-level sandboxing. Each user or team can receive a separate app instance with its own config, workspace, files, skills, and runtime data.
+
+## 2. Product-Level System Map
+
+```text
+Auth and onboarding
+  auth-portal
+    -> register/login
+    -> model provider onboarding
+  authz-service
+    -> account and backend identity
+  deploy-control
+    -> create/configure/remove app-instance
+  router-proxy
+    -> route instance host to app-instance container
+
+User workspace
+  app-instance/frontend
+    -> chat workbench
+    -> tasks
+    -> files
+    -> skills
+    -> marketplace
+    -> MCP/tools
+    -> notifications/cron
+    -> connectors
+    -> settings/status/logs
+
+Agent runtime
+  app-instance/backend
+    -> interfaces
+    -> services
+    -> engine
+    -> coordinator
+    -> tools
+    -> skills
+    -> memory
+    -> integrations
+```
+
+## 3. Deployment Components
+
+### Auth Portal
+
+Responsibility:
+
+- User login and registration entry.
+- Provider onboarding after registration.
+- Handoff into the user app instance.
+
+Product value:
+
+- Gives non-technical users a clean entry point.
+- Separates account onboarding from the per-instance app.
+
+Key risk:
+
+- Provider configuration must be understandable and recoverable for non-engineer users.
+
+### AuthZ Service
+
+Responsibility:
+
+- Account and backend identity orchestration.
+- Internal token-protected coordination.
+
+Product value:
+
+- Centralizes identity relationships between portal and app backends.
+
+Key risk:
+
+- Misconfigured issuer/internal URL can break new app instances.
+
+### Deploy Control
+
+Responsibility:
+
+- Create, configure, and manage app instances.
+- Call `app-instance/create-instance.sh`.
+- Write provider config and restart instance when needed.
+
+Product value:
+
+- Makes private instance provisioning repeatable.
+
+Key risk:
+
+- Must not be exposed publicly.
+- Needs health checks and lifecycle operations for pilot scale.
+
+### Router Proxy
+
+Responsibility:
+
+- Route hostnames to the correct app instance container.
+
+Product value:
+
+- Lets each instance have a stable public URL.
+
+Key risk:
+
+- Domain, wildcard DNS, HTTPS, and route reload errors can block access.
+
+### App Instance
+
+Responsibility:
+
+- The user-facing Beaver workspace.
+- Runs frontend, backend, and Nginx in one container.
+- Mounts the instance's `beaver-home` as config and workspace boundary.
+
+Product value:
+
+- Provides practical sandboxing for early private deployments.
+
+Key risk:
+
+- Instance lifecycle, backup, restore, and resource limits need productized operations.
+
+## 4. App Instance Product Modules
+
+### Frontend Modules
+
+| Module | Route | Product Job |
+| --- | --- | --- |
+| Chat workbench | `/` | Main workspace for conversation, attachments, task cards, and acceptance |
+| Tasks | `/tasks`, `/tasks/[taskId]` | Track ordinary and scheduled task lifecycle, timeline, evidence, artifacts |
+| Notifications | `/notifications` | Review proactive or scheduled outputs |
+| Cron | `/cron` | Manage scheduled jobs |
+| Files | `/files` | Browse, upload, preview, download, delete workspace files |
+| Skills | `/skills` | Manage published skills, candidates, drafts, safety/eval, review, publish |
+| Marketplace | `/marketplace` | Discover and install skills |
+| MCP/tools | `/mcp` | Manage tool servers, tool details, test, add, edit, delete |
+| Agents | `/agents` | Manage Agent definitions and roles |
+| Outlook/connectors | `/outlook`, settings connector panels | Connect external systems |
+| Settings/status/logs | `/settings`, `/status`, `/logs` | Configure providers, runtime, channels, health, and debugging |
+
+### Backend Modules
+
+| Module | Responsibility |
+| --- | --- |
+| `foundation` | Shared config, errors, events, utilities, base models |
+| `engine` | Unified Agent runtime used by main Agent and sub-agents |
+| `coordinator` | Multi-agent sequence/parallel/DAG execution |
+| `tools` | Built-in and MCP tool registration/execution |
+| `skills` | Skill loading, resolution, drafts, learning, review, publish |
+| `memory` | Long-term memory and run/skill stores |
+| `permissions` | Governance and policy surface |
+| `services` | Application orchestration, tasks, cron, process projection |
+| `interfaces` | Web, CLI, Gateway, channels, MCP servers |
+| `integrations` | AuthZ, MCP, external protocols, connector clients |
+
+## 5. Core Product Flows
+
+### Flow A: New User Registration And First Workspace
+
+```text
+Browser
+  -> auth-portal register
+  -> authz-service /portal/register
+  -> deploy-control /api/instances/register
+  -> create app-instance container
+  -> app-instance backend registers user/backend
+  -> provider onboarding
+  -> deploy-control configures provider
+  -> user enters app-instance URL
+```
+
+Product requirements:
+
+- Clear success/failure state during provisioning.
+- Provider setup can be skipped but instance must explain missing model config later.
+- Internal control-plane endpoints stay private.
+
+### Flow B: Chat To Managed Task
+
+```text
+User message
+  -> chat workbench
+  -> backend task router
+  -> ordinary chat or task mode
+  -> task created
+  -> Agent execution
+  -> tool calls and artifacts
+  -> task timeline
+  -> user accepts / asks revision / abandons
+```
+
+Product requirements:
+
+- The user must understand when a message became a task.
+- The task must be recoverable from chat, task list, and details page.
+- Acceptance feedback must influence future learning.
+
+### Flow C: Complex Task With Agent Team
+
+```text
+Task request
+  -> TaskExecutionPlanner
+  -> ExecutionGraph
+       -> sequence / parallel / DAG nodes
+  -> TaskSkillResolver binds skills or ephemeral guidance
+  -> LocalAgentRunner executes nodes
+  -> main Agent synthesizes final answer
+  -> evidence saved
+```
+
+Product requirements:
+
+- Team execution should be visible without overwhelming users.
+- Failed subtasks should be diagnosable.
+- Final synthesis should cite or summarize subtask evidence.
+
+### Flow D: Skill Learning Loop
+
+```text
+Accepted task
+  -> skill learning candidate
+  -> draft synthesis
+  -> safety report
+  -> eval report
+  -> human review
+  -> publish
+  -> future skill retrieval
+```
+
+Product requirements:
+
+- Only accepted or otherwise high-signal work should become skill candidates.
+- Publishing requires review and gates.
+- Skill quality must be traceable over versions.
+
+### Flow E: File And Tool Work
+
+```text
+User uploads file or Agent needs file/tool
+  -> workspace file API or tool registry
+  -> Agent tool execution
+  -> result returned to context
+  -> event/evidence saved
+  -> artifact available in task or files
+```
+
+Product requirements:
+
+- User-visible file roots must stay simple.
+- Tool calls must be recorded.
+- Dangerous tools need policy and review.
+
+### Flow F: Scheduled Work And Notifications
+
+```text
+User creates scheduled job
+  -> cron service stores job
+  -> scheduled run triggers task/notification
+  -> user reviews output
+  -> output can become normal task continuation
+```
+
+Product requirements:
+
+- Scheduled outputs need the same acceptance path as manual tasks.
+- Failed scheduled runs need alerts and retry/recovery.
+
+### Flow G: External Connectors
+
+```text
+Connector setup
+  -> channel/connector config
+  -> sidecar or external provider
+  -> inbound event or outbound action
+  -> Beaver task/runtime
+  -> response or notification
+```
+
+Product requirements:
+
+- External writes need clear user/admin control.
+- Connector onboarding must show state, errors, and reconnect steps.
+- Multi-instance callback routing must be explicit.
+
+## 6. Governance Boundaries
+
+### Instance Boundary
+
+Each app instance owns:
+
+- `config.json`
+- `web_auth_users.json`
+- `workspace/`
+- skills and runtime state
+- provider configuration
+
+Risk:
+
+- Cross-instance leakage would be a critical incident.
+
+### Control Plane Boundary
+
+Public exposure should be limited to:
+
+- Auth portal.
+- Router proxy for app instances.
+
+Do not expose:
+
+- `deploy-control`.
+- `authz-service`.
+
+### Tool Boundary
+
+Tools are the action surface. Policy should distinguish:
+
+- Read-only tools.
+- Workspace-scoped write tools.
+- External write tools.
+- Destructive tools.
+- Credential/permission/payment tools.
+
+### Skill Boundary
+
+Skills guide Agent behavior and tool use. Publishing a bad skill can create repeated bad behavior. Skill publishing therefore needs:
+
+- Candidate quality signal.
+- Safety report.
+- Eval/replay evidence where possible.
+- Human review.
+- Version rollback.
+
+### Memory Boundary
+
+Memory creates long-term product value but also trust risk. Productization should include:
+
+- Source.
+- Confidence.
+- Last used.
+- Edit/delete/freeze controls.
+- Task evidence showing when memory was used.
+
+## 7. Architecture Maturity
+
+| Area | Maturity | Notes |
+| --- | --- | --- |
+| Multi-instance deployment | Pilot-ready | Needs lifecycle and health automation |
+| Chat workbench | Pilot-ready | UI docs show tested states |
+| Task lifecycle | Strong | Core product loop exists |
+| Task evidence | Strong foundation | Needs narrative/summary layer |
+| Agent team | Functional | Needs product explanation and failure UX |
+| Files | Pilot-ready | UI docs show tested workflows |
+| Tools/MCP | Functional | Needs policy hardening and admin clarity |
+| Skills | Functional | Needs stronger quality gates and reuse metrics |
+| Memory | Backend foundation | Needs visible product controls |
+| Scheduled work | Basic product capability | Needs stability and clearer run handling |
+| Connectors | Mixed maturity | Need pilot-safe connector list |
+| Operations | Basic | Needs health console, backup/restore, runbook |
+
+## 8. Architecture Risks
+
+| Risk | Severity | Mitigation |
+| --- | --- | --- |
+| Control-plane service exposed publicly | Critical | Deployment checks and docs; firewall/proxy validation |
+| Instance data leakage | Critical | Path isolation tests, authz tests, MinIO/user-files policy checks |
+| Tool side effects without review | High | Tool policy profiles, evidence logs, connector sandbox |
+| Provider misconfiguration blocks first use | High | Onboarding checks and settings diagnostics |
+| Product surface becomes hard to operate | High | Admin health console and staged pilot scope |
+| Memory trust gap | High | Memory control center before broad memory activation |
+| Skill quality drift | High | Safety/eval/replay and publish gates |
+
+## 9. Recommended Architecture Roadmap
+
+### Next 30 Days
+
+- Rehearse clean deployment and record missing steps.
+- Add pilot health checklist for auth portal, authz, deploy control, router, and app instance.
+- Define pilot-safe tools and connectors.
+- Add task evidence narrative summary.
+- Track accepted task, skill candidate, and skill reuse events.
+
+### Next 90 Days
+
+- Memory Control Center MVP.
+- Admin Health Console MVP.
+- Instance suspend/resume/backup/restore runbook or tooling.
+- Connector sandboxing and side-effect policy.
+- Skill replay/eval as part of skill governance.
+- Organization/team-level roadmap decision.
+
+## 10. Product Architecture Principle
+
+Beaver should keep its product architecture centered on controlled Agent work:
+
+```text
+private workspace
+  + task lifecycle
+  + tool/file execution
+  + evidence
+  + acceptance
+  + skill/memory reuse
+  + operational governance
+```
+
+New features should strengthen this loop. Features that do not improve execution, evidence, acceptance, reuse, or governance should be treated as secondary until the pilot motion is proven.
--- a/docs/product-discovery/beaver/product-discovery-report.md
+++ b/docs/product-discovery/beaver/product-discovery-report.md
@ -0,0 +1,494 @@
+# Beaver Product Discovery Report
+
+Date: 2026-06-09
+
+Product stage: existing product
+
+Scope: the whole Beaver product, including deployment, runtime, UI, Agent execution, tasks, files, tools, skills, memory, connectors, scheduled work, governance, validation, launch, and maintenance.
+
+## Executive Summary
+
+Beaver is an enterprise Agent sandbox and execution platform. Its product promise is to move AI from "chat that gives answers" to "controlled Agent work that creates deliverables, records evidence, asks for acceptance, and turns accepted work into reusable capability."
+
+The strongest product wedge is not another chatbot UI. It is the full execution loop:
+
+```text
+user request
+  -> task recognition
+  -> Agent/team execution
+  -> tool and file work
+  -> evidence timeline
+  -> user acceptance or revision
+  -> skill and memory learning
+  -> future reuse
+```
+
+The current codebase already supports major parts of this loop: multi-instance Docker deployment, auth portal, app instances, chat workbench, task center, task details, user acceptance, files, tools, skills, skill learning, marketplace, settings, connectors, scheduled jobs, and backend Agent team orchestration. The next product challenge is packaging these capabilities into a clear buyer story, validating the highest-value use cases, hardening operational reliability, and making governance understandable to non-engineer stakeholders.
+
+Recommended product strategy:
+
+1. Position Beaver as "enterprise Agent execution and governance," not as a general AI chat app.
+2. Focus first on repeatable knowledge work that is high-frequency, cross-tool, evidence-sensitive, and review-heavy.
+3. Treat task acceptance, evidence, skills, and memory as the core product loop.
+4. Productize deployment and operations enough for pilots before broad feature expansion.
+5. Validate value through real workflows, not opinions about AI.
+
+## Product Summary
+
+### Product Description
+
+Beaver is a private-deployable Agent workspace for teams that need AI to perform work, not only answer questions. A user can chat, upload files, trigger tasks, review execution evidence, accept or revise results, manage tools, install or publish skills, configure model providers, connect external systems, and run scheduled work.
+
+### Target Users
+
+| Segment | Primary Need | Why Beaver Fits |
+| --- | --- | --- |
+| Enterprise AI platform owner | Provide controlled Agent capability to teams | Private deployment, per-instance boundaries, tools, skills, governance |
+| Knowledge workflow team | Finish recurring multi-step work faster | Task execution, files, tools, acceptance, scheduled work |
+| Project / delivery team | Produce and revise deliverables with traceability | Task timeline, artifacts, evidence, revision loop |
+| Engineering / support team | Use AI with files, commands, logs, and review | Tool execution, task evidence, multi-agent planning |
+| Operations / admin | Configure models, users, connectors, and instances | Auth portal, deploy control, settings, status, logs |
+| Skill owner / reviewer | Turn successful work into reusable methods | Skill candidates, drafts, safety/eval reports, review, publish |
+
+### Current Feature Map
+
+| Domain | Current State | Product Meaning |
+| --- | --- | --- |
+| Auth and onboarding | Auth portal, register/login, model provider onboarding | Users can enter a controlled workspace |
+| Multi-instance deployment | Deploy control creates isolated app-instance containers; router proxy routes by host | Enables per-user or per-team sandboxing |
+| Chat workbench | Conversations, attachments, task cards, current task progress, acceptance controls | Main user workspace |
+| Task runtime | Auto task recognition, task creation, runs, timeline, status, acceptance | Converts chat into managed work |
+| Agent execution | Unified engine, main agent, sub-agent/team execution, sequence/parallel/DAG coordinator | Handles complex work beyond one response |
+| Tools | Built-in tools, MCP tools, tool management UI | Lets Agents act on files, web, terminal, integrations |
+| Files | Workspace file browser, upload, preview, download, delete | Gives AI and users a shared working surface |
+| Skills | Published skills, candidates, drafts, safety/eval, review, publish | Turns accepted work into reusable methods |
+| Marketplace | Skill discovery/install flow | Foundation for capability distribution |
+| Memory | Backend long-term memory foundation exists, product integration still incomplete | Future compounding personalization and organization knowledge |
+| Scheduled work | Cron jobs, notifications, scheduled task flows | Moves from reactive chat to proactive work |
+| Connectors | Outlook and external connector architecture; Feishu/Weixin-related sidecar paths | Brings Agent into real business channels |
+| Settings/status/logs | Provider config, agent config, channel config, runtime status, restart | Admin control and troubleshooting |
+
+### Current Value Proposition
+
+For enterprise teams:
+
+> Beaver provides a private Agent workspace where AI work is executed, tracked, reviewed, and reused. It gives teams the speed of AI assistance with the control needed for real business workflows.
+
+For product pilots:
+
+> Beaver is strongest when a team has recurring knowledge work that crosses files, tools, systems, and reviews.
+
+### Current Challenges
+
+| Challenge | Why It Matters |
+| --- | --- |
+| Product breadth is large | Buyers may not understand what to adopt first |
+| Memory is partly backend-ready but not fully productized | "越用越懂" story needs visible control |
+| Connector maturity varies by channel | Customer demos must avoid overpromising |
+| Multi-instance deployment is powerful but operationally sensitive | Pilot success depends on stable setup and clear runbooks |
+| Skill learning needs strong governance | Reuse can become risk if publishing is weak |
+| Metrics are not yet productized | Hard to prove pilot value without baseline and target |
+| Customer research is not yet captured | Current roadmap is inferred from implementation and product judgment |
+
+## User Segments
+
+### Segment 1: Enterprise AI Platform Owner
+
+They need to safely introduce Agent capability into an organization. Their concern is not whether an LLM can answer a question; it is whether teams can use it without losing control of data, tools, cost, and quality.
+
+### Segment 2: Workflow Owner
+
+They own a recurring process such as weekly reporting, project status, proposal drafting, research, operations follow-up, support triage, or document review. They want less manual coordination and more repeatable output.
+
+### Segment 3: Individual Knowledge Worker
+
+They want one workspace where they can chat, upload files, run tools, generate artifacts, and continue a task until the output is usable.
+
+### Segment 4: Admin / Operator
+
+They need to create instances, configure models, monitor status, debug logs, manage connectors, and keep deployment safe.
+
+### Segment 5: Skill Maintainer
+
+They curate reusable skills, review drafts, evaluate safety, publish stable versions, and prevent low-quality automation from spreading.
+
+## JTBD
+
+| User | Job Story | Current Alternative | Beaver Outcome |
+| --- | --- | --- | --- |
+| Platform owner | When teams ask for AI tools, I want a controlled Agent workspace so they can experiment without unmanaged SaaS sprawl | ChatGPT accounts, custom scripts, internal demos | Private, governed Agent workspace |
+| Workflow owner | When a recurring process takes many manual steps, I want AI to execute and track it so my team can review the result | Manual docs, spreadsheets, Slack/email coordination | Task with timeline, artifacts, acceptance |
+| Knowledge worker | When I ask AI to produce something, I want to revise and accept it as work, not just receive a message | Chat thread and copy/paste | Task lifecycle and deliverable loop |
+| Admin | When a user registers, I want a workspace created and routed automatically so onboarding is repeatable | Manual container setup | Auth portal + deploy control + router proxy |
+| Skill maintainer | When a task succeeds, I want to turn its method into a reusable skill so future tasks improve | Prompt docs, tribal knowledge | Skill candidate/draft/review/publish |
+| Security reviewer | When Agents use tools, I want evidence and boundaries so I can audit behavior | Opaque model/tool calls | Tool traces, task evidence, instance sandbox |
+
+## Opportunity Areas
+
+Opportunity scores are qualitative estimates from current docs and product context. They need validation with customer interviews and pilot data.
+
+| Opportunity | Importance | Current Satisfaction | Opportunity Score | Notes |
+| --- | ---: | ---: | ---: | --- |
+| I need AI outputs to become reviewable tasks, not loose chat replies | 0.95 | 0.30 | 0.67 | Core wedge |
+| I need evidence of what the Agent did | 0.90 | 0.35 | 0.59 | Governance driver |
+| I need repeatable workflows to become reusable skills | 0.85 | 0.40 | 0.51 | Learning moat |
+| I need private deployment and instance boundaries | 0.90 | 0.45 | 0.50 | Enterprise adoption |
+| I need AI to work across files, tools, and external systems | 0.85 | 0.45 | 0.47 | Workflow depth |
+| I need proactive scheduled work, not only reactive chat | 0.70 | 0.45 | 0.39 | Expansion opportunity |
+| I need memory that I can inspect and control | 0.80 | 0.25 | 0.60 | High future leverage |
+
+Top opportunities:
+
+1. Make AI work reviewable and acceptable.
+2. Make process evidence and governance visible.
+3. Turn accepted work into reusable skills and memory.
+
+## Product Positioning
+
+Recommended primary positioning:
+
+> Beaver is an enterprise Agent execution and governance platform for repeatable knowledge work.
+
+Supporting message:
+
+> It gives teams a private Agent sandbox where AI can use tools, manage files, execute tasks, record evidence, ask for acceptance, and learn reusable skills from approved work.
+
+Avoid positioning Beaver as:
+
+- A generic chatbot.
+- A pure model gateway.
+- A standalone RPA replacement.
+- A developer-only Agent framework.
+- A marketplace-only skill product.
+
+## Competitive Frame
+
+| Category | Strength | Gap Beaver Addresses |
+| --- | --- | --- |
+| AI chat apps | Fast answers and content generation | Weak task lifecycle, evidence, acceptance, and reuse |
+| RPA / automation | Repeatable process execution | Rigid flows, harder natural-language adaptation |
+| Agent frameworks | Developer flexibility | Missing complete user workspace and governance surface |
+| Internal scripts | Fast local automation | Poor product UX, auditability, onboarding, and scaling |
+| Enterprise AI platforms | Governance and admin | Often weak on task-level execution and skill learning loop |
+
+## Product Ideas
+
+Generated from PM, design, and engineering perspectives.
+
+### PM Ideas
+
+1. Pilot Workflow Templates: package 3-5 high-value workflows such as weekly report, project brief, support triage, document review.
+2. Team Workspace Mode: group multiple users under one organization workspace with shared skills and controlled memory.
+3. Governance Scorecard: show evidence coverage, accepted tasks, skill reuse, failed runs, and tool risk.
+4. Skill Quality Lifecycle: strengthen candidate -> draft -> safety -> eval -> review -> publish -> version rollback.
+5. ROI Dashboard: measure time saved, accepted tasks, revision rounds, reusable skill adoption.
+
+### Design Ideas
+
+1. Work Inbox: unify tasks, scheduled runs, notifications, and pending reviews.
+2. Task Evidence Narrative: convert raw events into readable "what happened" timeline.
+3. Memory Control Center: show what Beaver remembers, why, source, confidence, and edit/delete controls.
+4. First-Run Product Tour: guide a new user from provider setup to first accepted task.
+5. Admin Health Console: one page for instance, provider, connector, queue, and runtime health.
+
+### Engineering Ideas
+
+1. Tenant/Workspace Policy Profiles: control allowed tools, connectors, memory behavior, and publish gates per deployment.
+2. Connector Sandbox Layer: test external channel actions without touching production systems.
+3. Unified Evidence Schema: normalize task, tool, artifact, skill, memory, and connector events.
+4. Replay-Based Skill Evaluation: evaluate skill drafts against historical accepted runs.
+5. Instance Lifecycle Automation: suspend, resume, backup, restore, rotate secrets, inspect health.
+
+Top 5 product ideas to pursue:
+
+| Rank | Idea | Why Selected | Assumptions |
+| ---: | --- | --- | --- |
+| 1 | Pilot Workflow Templates | Gives customers a concrete starting point | Initial buyers share common workflows |
+| 2 | Task Evidence Narrative | Makes governance understandable | Reviewers value readable evidence |
+| 3 | Memory Control Center | Unlocks long-term differentiation | Users trust memory if they can inspect/control it |
+| 4 | Governance Scorecard | Helps buyers justify adoption | Platform owners need measurable proof |
+| 5 | Instance Lifecycle Automation | Reduces pilot operational risk | Deployments will grow beyond a few instances |
+
+## Key Assumptions
+
+| Assumption | Category | Impact | Uncertainty |
+| --- | --- | ---: | ---: |
+| Enterprise teams feel enough pain with chat-only AI to adopt an Agent workspace | Value | High | Medium |
+| Task acceptance is a meaningful quality signal | Value | High | Medium |
+| Users will tolerate a task workflow instead of expecting instant chat only | Usability | High | Medium |
+| Per-instance deployment is operationally acceptable for early customers | Feasibility | High | Medium |
+| Workflow owners can identify repeatable tasks worth piloting | Value | High | Low |
+| Skill reuse creates visible productivity gains | Business Viability | High | High |
+| Memory control is required before customers trust long-term memory | Trust | High | Medium |
+| Connectors are necessary for customer stickiness | Value | Medium | Medium |
+| Admins can manage model provider configuration without heavy support | Usability | Medium | Medium |
+| The team can maintain broad product surface without quality drift | Team Capability | High | High |
+
+## Prioritized Assumptions
+
+### P0 Validate Immediately
+
+| Assumption | Why It Matters | What Could Go Wrong | Validation |
+| --- | --- | --- | --- |
+| Customers prefer task-based AI execution over chat-only for real work | Core product wedge | Users see tasks as overhead | Run 3 workflow pilots and compare chat-only vs task loop |
+| Evidence timeline increases trust | Governance story depends on it | Evidence is too technical or noisy | Reviewer usability test with task timelines |
+| Private multi-instance deployment is acceptable | Adoption depends on ops fit | Setup too fragile or expensive | Deploy pilot on fresh Linux host and measure time/errors |
+| Accepted tasks can generate reusable skills that users value | Learning loop depends on this | Skills are low quality or unused | Track reuse of skills from accepted pilot tasks |
+
+### P1 Important
+
+| Assumption | Why It Matters | Validation |
+| --- | --- | --- |
+| Memory control center is required before broad rollout | Trust and differentiation | Interview pilot admins and users |
+| Connectors drive retention | External systems make workflows real | Compare pilot workflows with and without Outlook/IM connectors |
+| Scheduled work creates high-value usage | Moves Beaver from reactive to proactive | Test weekly report and reminder workflows |
+| Marketplace/skill distribution is a buyer requirement | Scaling reuse across teams | Ask platform owners during procurement |
+
+### P2 Later
+
+| Assumption | Why It Matters | Validation |
+| --- | --- | --- |
+| Multi-user team workspace is required for first paid pilots | Could reshape architecture | Validate with buyer interviews |
+| Fine-grained per-tool policies are needed in UI | Admin complexity | Observe support requests |
+| Cross-instance organization analytics is needed early | Enterprise reporting | Validate after 2-3 pilots |
+
+## Opportunity Solution Tree
+
+Desired outcome:
+
+> Within 90 days, prove that a pilot team can complete repeatable AI-assisted work with acceptance, evidence, and reuse: at least 30 accepted tasks, 5 reusable skills, 2 recurring workflows, and 0 critical deployment/security incidents.
+
+```text
+Outcome: Trusted repeatable Agent work in pilot teams
+
+Opportunity 1: I need AI outputs to become reviewable deliverables.
+  Solution 1.1: Task lifecycle with acceptance and revision.
+    Experiment: Run a project brief workflow and measure accepted output rate.
+  Solution 1.2: Task details page with evidence narrative.
+    Experiment: Ask reviewers to reconstruct what happened from timeline.
+  Solution 1.3: Work Inbox for pending reviews and scheduled outputs.
+    Experiment: Fake-door navigation item and measure clicks/asks.
+
+Opportunity 2: I need confidence that Agent tool use is controlled.
+  Solution 2.1: Tool traces and artifact timeline.
+    Experiment: Security review of 5 real tasks.
+  Solution 2.2: Admin health and policy console.
+    Experiment: Operator performs setup/debug checklist on fresh instance.
+  Solution 2.3: Connector sandbox and side-effect journals.
+    Experiment: Test external send/reply flows in sandbox mode.
+
+Opportunity 3: I need successful work to become reusable.
+  Solution 3.1: Skill candidate -> draft -> review -> publish.
+    Experiment: Convert 5 accepted tasks into skills and track reuse.
+  Solution 3.2: Memory Control Center.
+    Experiment: Prototype memory review UI and test trust/comprehension.
+  Solution 3.3: Pilot workflow templates.
+    Experiment: Package 3 templates and measure first-task success rate.
+```
+
+## Validation Experiments
+
+| Assumption | Hypothesis | Experiment | Duration | Success Criteria |
+| --- | --- | --- | --- | --- |
+| Task loop beats chat-only | Users complete more usable work with task acceptance than plain chat | Same workflow performed in chat-only and Beaver task loop | 1 week | Beaver output accepted in fewer revision rounds |
+| Evidence creates trust | Reviewers can understand and audit what happened | Give 5 timelines to reviewers | 2 days | >=80% identify tools, artifacts, result, and risk |
+| Deployment is pilot-ready | Fresh host setup is repeatable | Deploy on clean Linux/WSL2 machine using docs | 1 day | Setup under 2 hours with no undocumented step |
+| Skills create reuse | Accepted tasks can become useful skills | Convert 5 pilot tasks into skills | 2 weeks | 3 skills reused at least twice |
+| Memory needs control UI | Users trust memory more with inspect/edit/delete | Clickable prototype or simple page | 3 days | >=80% say they would enable memory with controls |
+| Scheduled work matters | Recurring workflows create repeat usage | Weekly report or reminder pilot | 2-4 weeks | At least 2 recurring jobs run and get accepted outputs |
+
+## Feature Prioritization
+
+### Must Have
+
+| Feature | Impact | Effort | Risk | Reason |
+| --- | --- | --- | --- | --- |
+| Auth portal and instance onboarding | High | High | Medium | Required for any user to start |
+| Provider configuration flow | High | Medium | Medium | Model access is prerequisite |
+| Chat workbench | High | High | Medium | Primary user surface |
+| Task lifecycle and acceptance | High | High | Medium | Core differentiation |
+| Task timeline/evidence | High | High | Medium | Governance and review |
+| Files workspace | High | Medium | Medium | Most real workflows need files |
+| Tool management | High | Medium | High | Agents need controlled action surface |
+| Skills review/publish | High | High | High | Reuse loop |
+| Settings/status/logs | High | Medium | Medium | Operational support |
+| Basic deployment guide/runbook | High | Medium | Medium | Pilot readiness |
+
+### Should Have
+
+| Feature | Impact | Effort | Risk | Reason |
+| --- | --- | --- | --- | --- |
+| Pilot workflow templates | High | Medium | Low | Creates adoption path |
+| Evidence narrative layer | High | Medium | Medium | Makes audit readable |
+| Memory Control Center | High | High | Medium | Unlocks long-term trust |
+| Skill replay/eval hardening | High | High | High | Makes learning safer |
+| Scheduled workflow polish | Medium | Medium | Medium | Supports proactive use cases |
+| Connector onboarding polish | Medium | High | High | Needed for real systems |
+| Admin health console | Medium | Medium | Medium | Reduces support load |
+
+### Could Have
+
+| Feature | Reason |
+| --- | --- |
+| Multi-user organization workspace | Valuable, but changes scope and permissions |
+| Cross-instance analytics | Useful after multiple deployments |
+| Fine-grained policy UI | Need policy demand before UI complexity |
+| Audit export | Strong sales support, not first pilot blocker |
+| Cost/quality model router | Useful after usage volume grows |
+
+### Not Yet
+
+| Feature | Reason |
+| --- | --- |
+| Broad public SaaS launch | Product and ops need pilot hardening first |
+| Fully autonomous publish of skills | Human review should remain mandatory |
+| Production writes through connectors without review | Trust risk |
+| Complex enterprise RBAC before pilot validation | May overbuild before segment clarity |
+
+## Metrics Dashboard
+
+### North Star Metric
+
+Accepted Agent Workflows:
+
+> Number of AI-assisted tasks or scheduled workflows accepted by users per active pilot team per week.
+
+Why this metric: it captures real delivered value better than messages sent, tokens used, or model calls.
+
+### Input Metrics
+
+| Metric | Definition | Target For Pilot |
+| --- | --- | --- |
+| Task Creation Rate | Tasks created / active users / week | Increasing weekly |
+| Acceptance Rate | Accepted task runs / completed task runs | >=60% in pilot |
+| Revision Rate | Runs needing revision / completed runs | Track down over time |
+| Evidence Coverage | Task runs with timeline/tool/artifact evidence / task runs | >=90% |
+| Skill Candidate Rate | Accepted tasks producing candidates / accepted tasks | >=20% after week 2 |
+| Skill Reuse Rate | Runs activating published pilot skills / task runs | >=15% after skills exist |
+| Scheduled Success Rate | Accepted scheduled outputs / scheduled runs | >=50% for selected workflows |
+| Deployment Success Time | Fresh deployment time to first working user | <2 hours for pilot |
+
+### Guardrail Metrics
+
+| Metric | Alert |
+| --- | --- |
+| Critical tool/security incident | Any occurrence |
+| Instance creation failure rate | >10% in pilot |
+| Provider configuration failure rate | >20% |
+| Task run failure rate | >20% for 2 consecutive days |
+| Connector side-effect incident | Any unintended external write |
+| User file permission/storage incident | Any cross-user or cross-instance leak |
+| p95 task completion latency | Exceeds pilot workflow tolerance |
+
+### Business Metrics
+
+- Pilot activation: teams reaching first accepted task.
+- Time to first accepted task.
+- Weekly active task users.
+- Repeated workflow count.
+- Skill reuse per team.
+- Customer-reported time saved.
+- Pilot conversion intent.
+
+## Customer Research Plan
+
+No direct interview transcripts were provided. Research should start immediately before locking roadmap.
+
+### Participants
+
+- 5 knowledge workers with recurring document/report/research workflows.
+- 3 workflow owners or team leads.
+- 3 enterprise AI platform/admin stakeholders.
+- 2 security or IT reviewers.
+- 2 engineers/operators who would deploy and maintain Beaver.
+
+### Questions
+
+- What recurring work is painful enough to delegate to an Agent?
+- What would make an AI output "acceptable" instead of just "interesting"?
+- What evidence do you need to trust Agent work?
+- What systems must the Agent connect to for the workflow to matter?
+- What would make you stop a pilot?
+- What memory or reuse behavior feels helpful vs risky?
+- What does a successful 30-day pilot need to prove?
+
+## Interview Guide
+
+### Opening
+
+We are studying how teams move AI from chat into real work. We are not asking whether you like an idea. We want examples of work you recently did.
+
+### Current Behavior
+
+- Walk me through the last time you used AI for a real work deliverable.
+- What happened after the AI gave an answer?
+- What did you copy, edit, verify, or redo manually?
+- Who reviewed the result?
+
+### Pain
+
+- What was the slowest or most annoying part?
+- What made the output hard to trust?
+- What tools or files were involved?
+- What evidence did you need but did not have?
+
+### Reuse
+
+- Have you repeated a similar workflow since then?
+- Did you reuse prompts, templates, scripts, or notes?
+- What would make that reuse safe for a team?
+
+### Governance
+
+- What AI actions would need approval?
+- What data or tools should be off limits?
+- Who needs to see the history of what happened?
+
+### Pilot
+
+- Which one workflow would you test first?
+- What result would make you expand usage?
+- What failure would make you stop?
+
+## Recommended Next 30 Days
+
+1. Pick 2-3 pilot workflows: project brief, weekly report, document review, support triage, or file processing.
+2. Run fresh deployment rehearsal from README/deployment guide and record gaps.
+3. Define pilot metrics and instrument accepted tasks, revisions, skill candidates, skill reuse, and run failures.
+4. Create a task evidence narrative prototype on top of existing timeline data.
+5. Package pilot workflow templates as skills or documented demos.
+6. Validate provider onboarding with 3 non-engineer users.
+7. Run security review for file boundaries, tool execution, connectors, and deploy-control exposure.
+8. Decide which connector(s) are pilot-safe.
+
+## Recommended Next 90 Days
+
+1. Complete Memory Control Center MVP.
+2. Harden skill learning with replay/eval and publish gates.
+3. Add Admin Health Console for provider, instance, connector, task queue, and runtime status.
+4. Improve instance lifecycle: suspend, resume, backup, restore, rotate secrets.
+5. Add customer-facing pilot scorecard.
+6. Formalize tool/connector policy profiles.
+7. Expand pilot from one workflow to one department.
+8. Build audit export after evidence narrative stabilizes.
+
+## Biggest Risks
+
+| Risk | Severity | Mitigation |
+| --- | --- | --- |
+| Product appears too broad and hard to adopt | High | Lead with pilot workflows and task loop |
+| Deployment complexity blocks pilots | High | Rehearsed runbook, health checks, support checklist |
+| Agent actions cause unintended side effects | Critical | Conservative tool policy, explicit connector sandboxing, evidence logs |
+| Task evidence is too technical | High | Evidence narrative and reviewer testing |
+| Skill learning publishes poor methods | High | Human review, safety/eval, replay validation |
+| Memory feels creepy or uncontrollable | High | Memory control UI before broad enablement |
+| Team spreads effort across too many modules | High | Prioritize task loop, evidence, skills, deployment reliability |
+
+## Recommended Immediate Actions
+
+1. Reframe all main product docs around Beaver as an Agent execution and governance platform.
+2. Treat Skill Replay Eval as a subfeature under the skill governance loop.
+3. Build the next roadmap around pilot workflows, not isolated modules.
+4. Make accepted tasks the main success metric.
+5. Productize memory and evidence before adding many new connectors.
+6. Prove deployment repeatability before selling broad private deployments.
--- a/docs/product-discovery/beaver/product-prd.html
+++ b/docs/product-discovery/beaver/product-prd.html
--- a/docs/product-discovery/beaver/validation-plan.md
+++ b/docs/product-discovery/beaver/validation-plan.md
@ -0,0 +1,378 @@
+# Beaver Validation Plan
+
+Date: 2026-06-09
+
+Purpose: validate Beaver as a whole product before broader rollout.
+
+## 1. Validation Strategy
+
+Beaver should be validated through real workflows, not through opinions about AI.
+
+The validation sequence:
+
+```text
+customer problem
+  -> workflow fit
+  -> first-run onboarding
+  -> task execution
+  -> evidence comprehension
+  -> acceptance/revision
+  -> skill reuse
+  -> deployment and operations
+  -> security/governance
+```
+
+## 2. Validation Questions
+
+### Product Value
+
+- Does Beaver solve a painful enough workflow problem?
+- Does task acceptance make AI work feel more reliable?
+- Do users complete more usable work than with chat-only AI?
+- Does skill reuse save time after repeated workflows?
+
+### Usability
+
+- Can users understand when chat becomes a task?
+- Can users find task evidence and artifacts?
+- Can users accept, revise, or abandon without confusion?
+- Can admins configure providers and connectors without engineering help?
+
+### Technical Feasibility
+
+- Can fresh deployments be created repeatably?
+- Can app instances stay isolated?
+- Can Agent tasks run reliably with files, tools, skills, and scheduled jobs?
+- Can failures be diagnosed from status/logs/events?
+
+### Governance And Security
+
+- Are control-plane services private?
+- Are file and workspace boundaries enforced?
+- Are tool calls recorded and reviewable?
+- Are external connector writes controlled?
+- Is memory inspectable and controllable before broad use?
+
+### Business Viability
+
+- Does a pilot team have enough recurring workflows?
+- Can the product produce measurable weekly value?
+- Can an admin operate it with acceptable support load?
+- Can the buyer justify expansion?
+
+## 3. Pilot Workflow Candidates
+
+| Workflow | Why It Fits | Required Capabilities | Success Signal |
+| --- | --- | --- | --- |
+| Weekly project report | Recurring, evidence-sensitive, review-heavy | scheduled work, files, task acceptance, artifacts | Report accepted weekly |
+| Project brief / proposal | Multi-step, document-heavy, revision-heavy | chat, files, tools, task timeline, revisions | Brief accepted after fewer rounds |
+| Document review | Clear deliverable and evidence need | files, task timeline, artifacts, acceptance | Review output accepted |
+| Support triage | Tool/context-heavy and repeatable | tasks, tools, memory, maybe connector | Triage summary accepted |
+| Research synthesis | Agent team fit, artifact-heavy | multi-agent, web/search, files, evidence | Synthesis accepted and reused |
+
+Recommended first pilot:
+
+1. Project brief or document review for manual task loop.
+2. Weekly project report for scheduled workflow.
+3. Skill reuse from the accepted outputs.
+
+## 4. Customer Discovery Validation
+
+### Participants
+
+- 5 end users.
+- 3 workflow owners.
+- 3 admins/platform owners.
+- 2 security reviewers.
+- 2 operators/engineers.
+
+### Method
+
+- 45-minute interviews using past-behavior questions.
+- 60-minute workflow walkthrough with Beaver.
+- Follow-up after one week of usage.
+
+### Evidence To Collect
+
+- Current workflow steps.
+- Time spent today.
+- Existing tools/files/systems involved.
+- Review/approval requirements.
+- Trust blockers.
+- Repeat frequency.
+- What would count as a successful pilot.
+
+### Pass Criteria
+
+- At least 3 workflows are repeated weekly or more.
+- At least 2 workflows involve files or external tools.
+- At least 2 stakeholders require evidence/auditability.
+- At least 1 team lead agrees to a real pilot workflow.
+
+## 5. Product Workflow Validation
+
+### Test 1: First Accepted Task
+
+Goal: user reaches first accepted task.
+
+Steps:
+
+1. Register or log in.
+2. Configure provider.
+3. Start from a suggested workflow or freeform chat.
+4. Upload or reference a file if needed.
+5. Let Beaver create/continue a task.
+6. Inspect output and evidence.
+7. Accept or request revision.
+
+Pass criteria:
+
+- User completes without developer assistance.
+- First accepted task occurs in one session.
+- User can explain what Beaver did.
+
+### Test 2: Revision Loop
+
+Goal: prove Beaver handles "not good enough yet."
+
+Steps:
+
+1. Run a task.
+2. Ask for a specific revision.
+3. Confirm the same task context continues.
+4. Accept revised output.
+
+Pass criteria:
+
+- Revision feedback is preserved.
+- Task timeline shows revision.
+- User does not need to restate full context.
+
+### Test 3: Evidence Review
+
+Goal: verify trust and auditability.
+
+Steps:
+
+1. Give reviewer a completed task detail page.
+2. Ask them what happened, what tools/files were used, and what result was produced.
+3. Ask whether they would approve the output.
+
+Pass criteria:
+
+- >=80% reviewers identify the key actions and artifacts.
+- Reviewers can state at least one risk or confidence reason.
+
+### Test 4: Skill Reuse
+
+Goal: prove accepted work can compound.
+
+Steps:
+
+1. Accept a task.
+2. Generate skill candidate/draft.
+3. Review and publish skill.
+4. Run a similar task.
+5. Check whether skill activates and improves work.
+
+Pass criteria:
+
+- At least 3 pilot skills are reused twice.
+- Users report lower effort on repeated task.
+
+### Test 5: Scheduled Workflow
+
+Goal: validate proactive work.
+
+Steps:
+
+1. Create scheduled job.
+2. Trigger or wait for scheduled run.
+3. Review notification/output.
+4. Accept or revise.
+
+Pass criteria:
+
+- Scheduled run is visible.
+- Output enters review flow.
+- Failed run has clear recovery path.
+
+## 6. Technical Validation
+
+### Deployment Validation
+
+Run on a fresh Linux/WSL2 host:
+
+1. Build images.
+2. Create Docker network.
+3. Start router proxy.
+4. Start authz service.
+5. Start deploy control.
+6. Start auth portal.
+7. Register user.
+8. Configure provider.
+9. Open app instance.
+10. Complete first task.
+
+Pass criteria:
+
+- Under 2 hours with docs only.
+- No undocumented environment variables.
+- Public exposure limited to auth portal and router proxy.
+
+### Instance Isolation Validation
+
+Checks:
+
+- Instance A cannot access Instance B workspace.
+- User file roots stay scoped.
+- Router sends host to correct container.
+- Provider config is instance-specific.
+- Deleting one instance does not affect another.
+
+Pass criteria:
+
+- No cross-instance reads/writes.
+- Registry state remains consistent.
+
+### Runtime Validation
+
+Checks:
+
+- Chat API.
+- WebSocket/runtime status.
+- Task creation and deletion.
+- Task detail events.
+- File upload/preview/download/delete.
+- Tool test.
+- Skill candidate/draft/review/publish.
+- Cron create/toggle/run/delete.
+- Settings provider save.
+- Runtime restart.
+
+Pass criteria:
+
+- Critical user flows pass on desktop and mobile viewport.
+- Failure states have visible recovery.
+
+## 7. Security And Governance Validation
+
+### Control Plane
+
+- Confirm `deploy-control` and `authz-service` are not publicly reachable.
+- Confirm tokens are required for control-plane calls.
+- Confirm instance creation cannot be triggered without authorization.
+
+### Files
+
+- Confirm only allowed user roots are visible.
+- Confirm absolute-style or cross-prefix paths are rejected.
+- Confirm delete operations require explicit user action.
+
+### Tools
+
+- Classify pilot tools as read, workspace write, external write, destructive, credential/permission.
+- Record tool calls in task evidence.
+- Block or require review for dangerous actions.
+
+### Connectors
+
+- Use sandbox/test accounts for pilot when possible.
+- Confirm callback base URL is per-instance.
+- Confirm disconnect/reconnect path.
+
+### Memory
+
+Until Memory Control Center exists:
+
+- Keep memory use conservative.
+- Document what is stored.
+- Avoid enabling opaque long-term memory for sensitive pilots.
+
+## 8. Usability Validation
+
+Viewports:
+
+- 320px.
+- 375px.
+- 390px.
+- 768px.
+- 1024px.
+- 1365px.
+- One mobile landscape viewport.
+
+Screens:
+
+- Auth portal login/register/provider onboarding.
+- Chat workbench.
+- Task list/detail.
+- Files.
+- Skills.
+- Marketplace.
+- Tools.
+- Notifications/cron.
+- Outlook/connectors if in pilot.
+- Settings/status/logs.
+
+Pass criteria:
+
+- No horizontal overflow.
+- No inaccessible critical controls.
+- Touch targets are usable.
+- Loading, empty, error, success, and disabled states are visible.
+
+## 9. Metrics Validation
+
+Instrument or manually collect:
+
+- Time to first accepted task.
+- Accepted tasks per user/team/week.
+- Acceptance rate.
+- Revision rate.
+- Task run failure rate.
+- Evidence coverage.
+- Skill candidates.
+- Skill drafts.
+- Published skills.
+- Skill reuse.
+- Scheduled run success.
+- Provider setup failure.
+- Instance creation failure.
+- Connector setup failure.
+
+Minimum pilot dashboard:
+
+```text
+Accepted tasks
+Acceptance rate
+Revision rate
+Task failures
+Skill reuse
+Scheduled runs
+Deployment/provider errors
+Critical incidents
+```
+
+## 10. Pilot Exit Criteria
+
+Proceed to broader rollout only if:
+
+- A pilot team completes >=30 accepted tasks in 30 days.
+- At least 2 recurring workflows are active.
+- At least 5 skills are created and 3 reused twice.
+- Task acceptance rate is >=60%.
+- No critical security or deployment incidents occur.
+- Fresh deployment can be repeated from docs.
+- Admin can diagnose common failures from status/logs/runbook.
+- Pilot owner can clearly state why Beaver is better than chat-only AI for their workflow.
+
+## 11. Decision Matrix
+
+| Result | Decision |
+| --- | --- |
+| High task acceptance, low skill reuse | Improve skill learning and workflow templates |
+| High interest, deployment friction | Invest in deploy runbook and health console |
+| Good demos, low recurring use | Revisit target segment and workflow selection |
+| High usage, trust concerns | Prioritize evidence narrative, policy, memory controls |
+| Connector demand dominates | Narrow connector roadmap to one high-value system |
+| Memory concerns dominate | Build Memory Control Center before expansion |
--- a/docs/product-discovery/skill-replay-eval/PRD-skill-replay-eval.md
+++ b/docs/product-discovery/skill-replay-eval/PRD-skill-replay-eval.md
@ -0,0 +1,387 @@
+# PRD: Skill Replay Eval
+
+Date: 2026-06-09
+
+Status: Product discovery complete; implementation validation required
+
+## 1. Summary
+
+Skill Replay Eval is Beaver's evidence-based quality gate for reusable Agent skills. It evaluates a skill draft against accepted historical task runs, compares baseline and candidate behavior, reports execution/surrogate/blocked tool coverage, checks preservation for revised skills, and helps reviewers decide whether a draft can be published.
+
+The goal is not to replace human review. The goal is to make review decisions safer, faster, and grounded in real task behavior.
+
+## 2. Contacts
+
+| Role | Owner | Comment |
+| --- | --- | --- |
+| Product | TBD | Owns scope, rollout, customer research, metrics |
+| Engineering | TBD | Owns replay runner, tool policy, eval report, UI wiring |
+| Design | TBD | Owns reviewer decision flow and report comprehension |
+| Security / IT reviewer | TBD | Owns replay side-effect policy and launch approval |
+| Customer pilot lead | TBD | Owns pilot participant selection and feedback loop |
+
+## 3. Background
+
+Beaver's product promise is that successful AI tasks can become reusable skills. This is valuable only if skill publishing is trustworthy. The current heuristic evaluator can estimate draft quality from text and accepted run metadata, but it cannot prove the draft behaves correctly in realistic tasks. It also cannot reliably detect tool misuse, unsafe side effects, or missing instructions in revised skills.
+
+The new design introduces replay-style evaluation:
+
+- Select accepted historical task cases.
+- Run a baseline arm and a candidate arm.
+- Execute safe tools in a replay context.
+- Record unsafe or unavailable tools for surrogate judgment.
+- Block destructive actions.
+- Aggregate score, coverage, confidence, regressions, and preservation risk.
+- Show the report in the Skills review page.
+- Use publish gates to prevent low-confidence or unsafe releases.
+
+Why now:
+
+- Beaver already has task evidence, accepted runs, skill candidates, skill drafts, safety reports, eval reports, review, and publish flow.
+- The customer-facing story emphasizes enterprise governance and reusable skills.
+- Without stronger eval, the skill-learning loop can create risk instead of trust.
+
+## 4. Objective
+
+### Objective
+
+Make skill publishing evidence-based and safe enough for enterprise pilot use.
+
+### Why It Matters
+
+For customers, Skill Replay Eval turns Beaver from "an Agent that can learn" into "an Agent platform with controlled learning." For the team, it reduces blind publish risk and creates a repeatable way to improve skill quality.
+
+### Key Results
+
+| Key Result | Target |
+| --- | --- |
+| Trusted Skill Publish Rate | >=80% of approved drafts have replay evidence or explicit skipped-provider evidence during pilot |
+| Replay Side-Effect Safety | 0 production side-effect incidents caused by replay |
+| Reviewer Decision Time | Median approve/reject/revise decision under 10 minutes for common drafts |
+| Report Comprehension | >=80% of reviewers correctly explain execution, surrogate, blocked, and confidence meanings in usability tests |
+| Regression Visibility | 100% of replay reports expose regression count, score delta, and case-level details |
+| Preservation Visibility | 100% of revise/merge replay reports with base content expose preservation result |
+
+## 5. Market Segments
+
+### Primary Segment: Enterprise AI Platform Teams
+
+They want private or controlled Agent deployment, reusable workflows, governance, and auditability. They need evidence before reusable skills are distributed.
+
+### Secondary Segment: Internal Workflow Teams
+
+They run repeatable knowledge workflows such as reports, support, project delivery, file processing, or research. They want accepted AI work to become reusable without manual prompt engineering every time.
+
+### Internal Segment: Beaver Operators And Engineers
+
+They need debuggable replay behavior, predictable tool policies, and operational visibility.
+
+### Constraints
+
+- Replay must not execute production external writes by default.
+- Replay should use existing stores and skill learning pipeline where possible.
+- Evaluation report payload must remain compatible with existing UI and stored reports.
+- First release should cap replay case count to control latency and cost.
+- Human review remains mandatory.
+
+## 6. Value Propositions
+
+### For Skill Reviewers
+
+Pain avoided: approving a skill by reading text only.
+
+Gain: see whether the candidate improves, regresses, or preserves behavior on accepted tasks.
+
+### For Enterprise Admins
+
+Pain avoided: uncontrolled AI learning that silently changes team behavior.
+
+Gain: clear publish gates, safety report, replay report, coverage, confidence, and preservation evidence.
+
+### For Workflow Owners
+
+Pain avoided: successful task patterns disappearing into chat history.
+
+Gain: accepted work can become reusable skills with validation before reuse.
+
+### For Engineers
+
+Pain avoided: debugging vague "skill quality" complaints.
+
+Gain: case-level traces, tool classifications, side effects, and reproducible failure categories.
+
+## 7. Solution
+
+### 7.1 UX / User Flow
+
+Primary reviewer flow:
+
+```text
+Skill candidate generated
+  -> draft created
+  -> safety report generated
+  -> replay eval report generated
+  -> reviewer opens Skills draft page
+  -> reviewer reads summary: pass/fail, baseline, candidate, delta, coverage, confidence
+  -> reviewer drills into cases, tool calls, side effects, preservation report
+  -> reviewer approves, requests revision, or rejects
+  -> publish gate enforces safety, eval, confidence, blocked coverage, preservation
+```
+
+Required UI behavior:
+
+- Show report status first: passed, failed, skipped provider, replay error, or partial.
+- Show baseline average, candidate average, and score delta.
+- Show execution coverage, surrogate coverage, blocked coverage, and confidence.
+- Show improved, regressed, and unchanged case counts.
+- Show replay cases in a compact table.
+- Show raw case reports only after the summary.
+- Show preservation report for revise/merge drafts.
+- Use clear wording for skipped-provider reports: no replay evidence was run.
+
+Recommended UI improvement:
+
+- Add a reviewer decision summary above raw details:
+  - "Recommended action: Approve / Revise / Reject / Needs manual review"
+  - "Reason: low confidence, preservation failure, regression, or blocked calls"
+
+### 7.2 Key Features
+
+#### Historical Case Selection
+
+Requirements:
+
+- Select up to 10 accepted historical runs.
+- For revised skills, prefer accepted runs that activated the target skill/version.
+- For new skills, use candidate source runs or similar task themes.
+- For merged skills, use accepted runs where related skills co-activated.
+- Prefer recent accepted runs and diversify repeated tasks.
+
+Acceptance criteria:
+
+- Case selection returns no more than 10 cases.
+- Failed or unaccepted runs are excluded.
+- Baseline skill names are populated for revise and merge candidates.
+
+#### Baseline And Candidate Replay Arms
+
+Requirements:
+
+- Run the same task text for both arms.
+- Use the same model settings, bounded historical context, max tool iterations, and replay policy.
+- Baseline arm uses no skill, old skill, or related old skills depending on candidate type.
+- Candidate arm injects the draft as pinned draft guidance.
+
+Acceptance criteria:
+
+- Both arms produce run id, session id, final answer, finish reason, tool calls, side effects, and artifacts.
+- Replay runs are marked with source `skill_replay_eval`.
+- Replay does not create user-visible normal task sessions.
+
+#### Tool Mode Classification
+
+Requirements:
+
+- Classify each tool call as:
+  - `executed`: safe to execute in replay context.
+  - `surrogate`: unsafe/unavailable to execute but can be judged from intended call.
+  - `blocked`: cannot safely execute or judge.
+- Safe defaults include filesystem, user files, core, web, and search where isolation is available.
+- External writes and connector/MCP write actions default to surrogate.
+- Destructive operations default to blocked.
+
+Acceptance criteria:
+
+- Each tool trace includes tool name, arguments, schema, toolset, metadata, mode, classification reason, and result.
+- Destructive terms such as delete/remove/destroy/revoke/permission/credential/payment/pay are blocked.
+- External write terms such as send/post/publish/create/update/invite/reply/forward are not executed against production systems by default.
+
+#### Surrogate Evaluation
+
+Requirements:
+
+- Score baseline and candidate intended tool use when tools are surrogate or blocked.
+- Include task text, tool schema, arguments, classification reason, final answer, and side effects in judgment payload.
+- Lower confidence when surrogate or blocked coverage is high.
+
+Acceptance criteria:
+
+- Reports include baseline score, candidate score, delta, confidence, and validator notes.
+- Blocked calls reduce score and confidence.
+- Surrogate scoring is transparent and does not pretend to be real execution.
+
+#### Preservation Check
+
+Requirements:
+
+- For revise and merge drafts, compare base skill content against proposed draft content.
+- Report preserved sections, changed sections, dropped sections, pass/fail, and risk level.
+- Failed preservation blocks publish.
+
+Acceptance criteria:
+
+- Revision drafts with dropped important sections fail preservation.
+- Reports are visible in the Skills UI.
+- Publish gate blocks failed preservation.
+
+#### Eval Report Model
+
+Requirements:
+
+- Extend existing `SkillDraftEvalReport` without breaking legacy reports.
+- Keep existing fields: passed, baseline_score_avg, candidate_score_avg, score_delta, regression_count, improved_count, unchanged_count, cases, status.
+- Add replay fields: eval_version, mode, execution_coverage, surrogate_coverage, blocked_coverage, confidence, case_reports, tool_mode_summary, preservation_report.
+
+Acceptance criteria:
+
+- Legacy reports deserialize with default replay fields.
+- New reports serialize all replay fields.
+- Frontend type definitions include replay fields.
+
+#### Publish Gates
+
+Requirements:
+
+- Draft must still have approved review and passing safety report.
+- Failed eval report blocks publish except explicit skipped-provider status.
+- Replay report with low confidence blocks publish.
+- Replay report with blocked coverage >=1.0 blocks publish.
+- Failed preservation blocks publish.
+
+Acceptance criteria:
+
+- Publish attempts fail with clear errors for each gate condition.
+- Skipped provider is visible and does not silently claim replay passed.
+
+### 7.3 Technology
+
+Backend:
+
+- Python dataclasses.
+- Existing file-backed memory stores.
+- `SkillLearningPipelineService.evaluate_draft()`.
+- `SkillDraftEvaluator`.
+- `ReplayRunner`, `ReplayToolExecutor`, `ReplayToolPolicy`.
+- `SurrogateToolEvaluator`.
+- `SkillDraftEvalReport`.
+- FastAPI endpoint wiring through existing Skills APIs.
+
+Frontend:
+
+- Next.js / TypeScript Skills page.
+- Existing design system and report card patterns.
+- Typed replay report fields in `types/index.ts`.
+
+Testing:
+
+- Unit tests for eval report compatibility.
+- Case selection tests.
+- Preservation tests.
+- Replay executor and replay runner tests.
+- Agent loop replay executor override tests.
+- Surrogate scoring tests.
+- Pipeline publish gate tests.
+- Frontend smoke/manual review for report rendering.
+
+### 7.4 Data Model
+
+Eval report fields:
+
+| Field | Type | Purpose |
+| --- | --- | --- |
+| `eval_version` | string | Version of eval model, e.g. `replay-v1` |
+| `mode` | string | `heuristic` or `replay` |
+| `execution_coverage` | number | Share of replay tool calls actually executed |
+| `surrogate_coverage` | number | Share judged through surrogate |
+| `blocked_coverage` | number | Share blocked |
+| `confidence` | string | low, medium, high |
+| `case_reports` | array | Detailed baseline/candidate case reports |
+| `tool_mode_summary` | object | Aggregate tool mode counts |
+| `preservation_report` | object/null | Preservation result for revise/merge |
+
+### 7.5 Assumptions
+
+- Accepted historical runs exist and are useful.
+- Replay can be isolated enough for safe tool execution.
+- Reviewers understand and trust the report after UI iteration.
+- Surrogate scoring can be improved over time without blocking v1.
+- Publish gates can be calibrated during pilot.
+
+### 7.6 Non-Goals
+
+- No production third-party writes during automatic replay.
+- No automatic publishing based only on replay score.
+- No full Docker orchestration per replay case in v1.
+- No customer-configurable per-tool policy UI in v1.
+- No replacement of human review.
+- No claim that replay is a complete benchmark of all future tasks.
+
+## 8. Release
+
+### V0: Internal Validation
+
+Scope:
+
+- Current replay report fields.
+- Current case selection.
+- Current replay runner integration.
+- Current tool policy.
+- Current Skills UI report display.
+- Current publish gates.
+
+Exit criteria:
+
+- Unit tests pass for skill learning replay surface.
+- Golden tool policy tests prove no production side effects.
+- Reviewer can make decisions from 5 seeded cases.
+- Known limitations are documented.
+
+### V1: Pilot Release
+
+Scope:
+
+- Reviewer decision summary.
+- Replay readiness indicator.
+- Better preservation diff.
+- Operational metrics for replay status, latency, provider skip, blocked coverage.
+- Customer-facing explanation for replay evidence and confidence.
+
+Exit criteria:
+
+- 0 replay side-effect incidents.
+- >=80% reviewer comprehension in usability test.
+- Median reviewer decision time under 10 minutes.
+- Pilot admins accept report as sufficient review support.
+
+### V2: Enterprise Hardening
+
+Scope:
+
+- LLM surrogate evaluator with human-labeled calibration.
+- Policy profiles by deployment risk tier.
+- Audit export.
+- Skill quality trend across versions.
+- Replay operations dashboard.
+
+Exit criteria:
+
+- Human vs surrogate agreement >=80% on unsafe tool golden set.
+- Clear process for policy changes and incident review.
+- Enterprise pilot customers can use audit export in governance review.
+
+## Open Questions
+
+- What minimum replay case count should be required before a report is considered useful?
+- Should skipped-provider reports block publish in regulated deployments?
+- What exact confidence levels should map to publish gate behavior?
+- Which toolsets are safe in each deployment mode?
+- How should reviewer overrides be recorded when they publish despite weak evidence?
+- What is the long-term storage retention policy for replay traces and artifacts?
+
+## Success Review Checklist
+
+- Product: Does the report answer "should this skill be published?"
+- Design: Can reviewers understand the summary without reading raw JSON?
+- Engineering: Can replay failures be reproduced and diagnosed?
+- Security: Can replay prove no production side effects by default?
+- Customer: Does this strengthen Beaver's enterprise trust story?
--- a/docs/product-discovery/skill-replay-eval/README.md
+++ b/docs/product-discovery/skill-replay-eval/README.md
@ -0,0 +1,13 @@
+# Skill Replay Eval Product Discovery
+
+This folder turns the Skill Replay Eval design into product-facing planning artifacts.
+
+- [Product Discovery Report](./product-discovery-report.md): opportunity, users, assumptions, experiments, feature priority, metrics, and 30/90 day recommendations.
+- [PRD](./PRD-skill-replay-eval.md): product requirements for engineering, design, review, validation, and release scope.
+- [Launch And Maintenance Runbook](./launch-maintenance-runbook.md): rollout, readiness checks, operational ownership, alerting, and maintenance cadence.
+
+Related source material:
+
+- [Skill Replay Eval Design](../../superpowers/specs/2026-06-08-skill-replay-eval-design.md)
+- [Skill Replay Eval Implementation Plan](../../superpowers/plans/2026-06-08-skill-replay-eval.md)
+- [Beaver customer presentation](../../presentations/skill-replay-eval/index.html)
--- a/docs/product-discovery/skill-replay-eval/launch-maintenance-runbook.md
+++ b/docs/product-discovery/skill-replay-eval/launch-maintenance-runbook.md
@ -0,0 +1,356 @@
+# Skill Replay Eval Launch And Maintenance Runbook
+
+Date: 2026-06-09
+
+Purpose: define how to validate, launch, operate, and maintain Skill Replay Eval safely.
+
+## 1. Launch Principle
+
+Ship Skill Replay Eval as a guarded trust feature.
+
+The system may help reviewers approve or reject a skill draft, but it must not create false certainty. When evidence is weak, the product should say so clearly. When tool safety is unclear, replay should prefer surrogate or blocked modes over production execution.
+
+## 2. Ownership
+
+| Area | Owner | Responsibility |
+| --- | --- | --- |
+| Product quality | Product owner | Metrics, pilot feedback, publish threshold decisions |
+| Replay pipeline | Backend engineer | Case selection, replay runner, scoring, report persistence |
+| Tool safety policy | Backend + security reviewer | Tool classification, blocked/surrogate rules, side-effect tests |
+| Skills UI | Frontend/design owner | Report summary, reviewer decision flow, report readability |
+| Operations | Deployment owner | Logs, alerts, provider availability, incident response |
+| Customer pilot | Pilot lead | Participant selection, feedback, rollout communication |
+
+## 3. Pre-Launch Readiness
+
+### Required Code Checks
+
+Run backend tests from `app-instance/backend`:
+
+```bash
+pytest tests/unit/test_skill_learning_eval_report_model.py -v
+pytest tests/unit/test_skill_learning_case_selection.py -v
+pytest tests/unit/test_skill_learning_preservation.py -v
+pytest tests/unit/test_skill_learning_replay.py -v
+pytest tests/unit/test_skill_learning_replay_runner.py -v
+pytest tests/unit/test_agent_loop_replay_executor.py -v
+pytest tests/unit/test_skill_learning_surrogate.py -v
+pytest tests/unit/test_skill_learning_eval.py -v
+pytest tests/unit/test_skill_learning_pipeline.py -v
+```
+
+Run frontend verification from `app-instance/frontend`:
+
+```bash
+npm run lint
+npm run test -- --runInBand
+```
+
+If frontend tests are not configured, perform manual Skills page verification with seeded report payloads.
+
+### Golden Safety Cases
+
+Before pilot launch, create or manually verify a golden set with these cases:
+
+| Case | Expected Result |
+| --- | --- |
+| Safe filesystem read | `executed` |
+| Safe filesystem write to replay workspace | `executed`, no production write |
+| User-file write in replay namespace | `executed` only if isolated, otherwise `surrogate` |
+| Web/search read | `executed` or cached read |
+| Email send | `surrogate` |
+| Calendar invite | `surrogate` |
+| Connector publish/post/reply | `surrogate` |
+| Delete/remove/destroy | `blocked` |
+| Permission/credential/payment action | `blocked` |
+
+Launch blocker:
+
+- Any replay case mutates production workspace, user files, credentials, external accounts, permissions, or payment state.
+
+### Report Readiness Checks
+
+Each replay report must show:
+
+- Eval status.
+- Baseline average.
+- Candidate average.
+- Score delta.
+- Improved/regressed/unchanged counts.
+- Execution coverage.
+- Surrogate coverage.
+- Blocked coverage.
+- Confidence.
+- Replay cases.
+- Case reports.
+- Preservation report when applicable.
+- Raw report for debugging.
+
+### Publish Gate Checks
+
+Publish must fail when:
+
+- No approved review exists.
+- Safety report is missing or failed.
+- Eval report failed, except explicit skipped-provider status.
+- Replay confidence is low.
+- Replay blocked coverage is `1.0`.
+- Preservation report failed.
+
+Publish may proceed with explicit human review when:
+
+- Provider is unavailable and eval status is `skipped_provider_unavailable`.
+- Replay evidence is partial, but reviewer records a rationale and deployment policy allows it.
+
+## 4. Rollout Plan
+
+### Phase 0: Shadow Mode
+
+Audience: internal team only.
+
+Duration: 1 week or 10 draft evaluations, whichever comes first.
+
+Behavior:
+
+- Generate replay reports.
+- Do not change existing publish decisions unless a critical safety issue appears.
+- Compare replay recommendation with human reviewer decision.
+
+Exit criteria:
+
+- No production side effects.
+- No unexplained replay crashes on common drafts.
+- Reviewers can explain report meaning.
+- Product owner reviews gate threshold data.
+
+### Phase 1: Strict Internal Gate
+
+Audience: internal maintainers and trusted reviewers.
+
+Behavior:
+
+- Enforce low-confidence, blocked coverage, failed preservation, failed eval, and failed safety gates.
+- Require manual rationale for skipped-provider publish.
+
+Exit criteria:
+
+- 0 P0 incidents.
+- Publish blockers are actionable and not noisy.
+- Reviewer median decision time under 10 minutes for common drafts.
+
+### Phase 2: Pilot Customer Gate
+
+Audience: selected pilot customer or internal department.
+
+Behavior:
+
+- Keep human review mandatory.
+- Provide customer-facing explanation of replay evidence.
+- Track skipped-provider and low-confidence cases closely.
+
+Exit criteria:
+
+- Pilot admin accepts report as useful governance evidence.
+- No side-effect incidents.
+- Top confusion points are documented and scheduled for UI copy/design improvements.
+
+### Phase 3: General Availability Candidate
+
+Audience: all enabled deployments.
+
+Behavior:
+
+- Replay Eval enabled by default where provider and case data are available.
+- Skipped-provider state remains explicit.
+- Tool policy remains conservative.
+
+Exit criteria:
+
+- Operational dashboard exists.
+- Incident response is rehearsed.
+- Policy change process is documented.
+
+## 5. Monitoring
+
+### Product Metrics
+
+| Metric | Owner | Cadence | Alert |
+| --- | --- | --- | --- |
+| Trusted Skill Publish Rate | Product | Weekly | <60% for 2 weeks |
+| Reviewer Decision Time | Product/design | Weekly | p95 >30 minutes |
+| Replay Regression Rate | Product/engineering | Weekly | >20% of replay reports |
+| Report Comprehension | Product/design | Per research round | <80% explain coverage/confidence correctly |
+
+### Operational Metrics
+
+| Metric | Owner | Cadence | Alert |
+| --- | --- | --- | --- |
+| Replay status counts | Engineering | Daily during pilot | Any spike in `replay_error` or `partial` |
+| Provider unavailable skip rate | Operations | Daily | >25% of evals in pilot |
+| Replay latency p50/p95 | Engineering | Daily | p95 >15 minutes |
+| Blocked coverage | Security/engineering | Weekly | Any report with blocked_coverage=1.0 |
+| Production side-effect incidents | Security/operations | Immediate | Any nonzero event |
+| Failed preservation reports | Product/engineering | Weekly | Spike after synthesizer change |
+
+### Logs To Inspect
+
+- Skill learning candidate events.
+- Draft creation and safety report events.
+- Eval report generation events.
+- Replay arm run ids and source `skill_replay_eval`.
+- Tool traces and classification reasons.
+- Publish gate errors.
+- Provider unavailable errors.
+
+## 6. Incident Response
+
+### P0: Production Side Effect During Replay
+
+Examples:
+
+- Email sent.
+- Calendar invite created.
+- External connector publish/post/reply happened.
+- Production file or credential changed.
+- Permission/payment action executed.
+
+Immediate actions:
+
+1. Disable replay eval generation.
+2. Disable skill publish if policy risk is unclear.
+3. Preserve logs, replay traces, eval reports, and affected tool metadata.
+4. Identify tool name, toolset, metadata, classification reason, arguments, and tenant.
+5. Patch policy to block or surrogate affected class.
+6. Add a regression test to golden safety cases.
+7. Notify pilot/customer owner if customer data or systems were affected.
+
+Restart criteria:
+
+- Root cause documented.
+- Regression test passes.
+- Security owner approves restart.
+
+### P1: False Pass
+
+Definition: draft passed replay and was published, then confirmed to regress a real accepted workflow.
+
+Actions:
+
+1. Unpublish or revert skill version if impact is active.
+2. Add the failed task as a replay case.
+3. Inspect whether case selection missed the scenario or scoring overrated it.
+4. Adjust gate threshold, surrogate scoring, or preservation check.
+5. Record postmortem in skill quality log.
+
+### P1: False Block
+
+Definition: useful draft blocked due to bad replay policy, low-confidence bug, or report construction issue.
+
+Actions:
+
+1. Do not bypass silently; record reviewer rationale.
+2. Identify blocking rule and trace.
+3. Add regression test if policy bug.
+4. Decide whether threshold should change or case should remain blocked.
+
+### P2: Provider Unavailable Spike
+
+Actions:
+
+1. Check provider configuration and model availability.
+2. Confirm fallback status is explicit.
+3. Track how many publish decisions rely on skipped-provider.
+4. Pause broad rollout if skipped-provider exceeds pilot threshold.
+
+## 7. Maintenance Cadence
+
+### Daily During Pilot
+
+- Check replay errors and provider skips.
+- Check blocked_coverage=1.0 reports.
+- Confirm no side-effect incidents.
+- Review new publish gate failures.
+
+### Weekly
+
+- Review metrics dashboard.
+- Calibrate publish gate thresholds.
+- Review 3-5 replay reports for readability.
+- Inspect false pass/false block candidates.
+- Update tool policy based on new tools or connectors.
+
+### Monthly
+
+- Review customer/pilot feedback.
+- Refresh golden safety cases.
+- Sample preservation reports for missed instruction drops.
+- Review storage growth from replay case reports and traces.
+- Decide whether to promote features from Should Have to Must Have.
+
+### Quarterly
+
+- Revisit risk model and tool policy profiles.
+- Review whether LLM surrogate calibration meets quality target.
+- Decide whether to add audit export or per-deployment policy UI.
+- Retire stale replay cases or update case selection logic.
+
+## 8. Data Retention And Privacy
+
+Replay reports may contain task text, tool arguments, schemas, final answers, and side-effect descriptions. Treat them as sensitive operational data.
+
+Recommended policy:
+
+- Store summarized report for normal review.
+- Limit raw case report retention or restrict access to admins.
+- Redact credentials, tokens, secrets, and obvious personal identifiers from tool arguments before display where possible.
+- Do not include production external write results because they should not execute.
+- Define tenant-specific retention before enterprise rollout.
+
+## 9. Release Communication
+
+### Internal Message
+
+Skill Replay Eval adds evidence to skill publishing. Reviewers will now see whether a draft improved, regressed, or preserved accepted task behavior. Reports disclose what executed, what was judged by surrogate, what was blocked, and whether revised skills preserved important sections.
+
+### Customer / Pilot Message
+
+Beaver can now evaluate reusable skill drafts against prior accepted work before publication. The report shows both confidence and uncertainty. Unsafe external actions are not executed automatically during replay; they are recorded for review or blocked by policy.
+
+### Known Limitations To Disclose
+
+- Replay quality depends on available accepted historical runs.
+- Surrogate evaluation is not the same as real execution.
+- Low-confidence reports require more human review.
+- Human approval is still required.
+- First release does not include per-tool policy UI or full per-case container orchestration.
+
+## 10. Rollback Plan
+
+Rollback options:
+
+1. Disable replay runner injection and fall back to heuristic eval.
+2. Keep report fields but set mode to `heuristic`.
+3. Keep publish gate requiring safety and human review.
+4. Temporarily treat replay errors as non-blocking only if security owner confirms no side-effect risk.
+5. Preserve failed replay reports for debugging.
+
+Rollback triggers:
+
+- Any P0 side-effect incident.
+- Repeated replay errors that block normal skill review.
+- Provider unavailable spike that makes most reports skipped.
+- Reviewer decision time becomes unacceptable and no quick UI fix exists.
+
+## 11. Launch Checklist
+
+- [ ] Backend replay tests pass.
+- [ ] Frontend report rendering verified.
+- [ ] Golden tool safety cases pass.
+- [ ] No production side-effect path found.
+- [ ] Publish gates tested manually.
+- [ ] Skipped-provider copy is clear.
+- [ ] Reviewer decision summary exists or is tracked as a launch follow-up.
+- [ ] Pilot participants selected.
+- [ ] Metrics dashboard owner assigned.
+- [ ] Incident owner and escalation path assigned.
+- [ ] Rollback path verified.
--- a/docs/product-discovery/skill-replay-eval/product-discovery-report.md
+++ b/docs/product-discovery/skill-replay-eval/product-discovery-report.md
@ -0,0 +1,512 @@
+# Skill Replay Eval Product Discovery Report
+
+Date: 2026-06-09
+
+Product stage: existing product
+
+Primary feature: Skill Replay Eval for Beaver skill learning and publishing
+
+Source context:
+
+- Existing product and deployment: `README.md`, `部署指南.md`
+- Feature design: `docs/superpowers/specs/2026-06-08-skill-replay-eval-design.md`
+- Delivery plan: `docs/superpowers/plans/2026-06-08-skill-replay-eval.md`
+- Current implementation signals: `beaver/skills/learning/{case_selection,preservation,replay,surrogate,eval}.py`, Skills page replay report UI, publish gate checks
+- Customer positioning: `docs/presentations/skill-replay-eval/index.html`
+
+## Executive Summary
+
+Beaver is positioned as an enterprise Agent execution and governance platform. Its core value is not only running tasks, but also making AI work traceable, acceptable, reusable, and governable. Skill Replay Eval is the quality gate that makes the "reusable skill" promise credible: before a skill draft is published, Beaver should test whether it improves or preserves real task behavior.
+
+The current design correctly identifies the product risk: heuristic-only skill scoring is not enough for enterprise trust. A draft skill can look complete in text while causing tool misuse, dropping safety instructions, or regressing accepted workflows. Replay evaluation closes this gap by comparing baseline and candidate behavior on accepted historical tasks, classifying tool calls into executed, surrogate, or blocked modes, and adding a preservation check for revised skills.
+
+The product direction should be: ship replay eval as a staged trust feature, not as a perfect benchmark system. The first release should make evaluation coverage and uncertainty visible, block obvious regressions, and give reviewers enough evidence to approve or reject drafts. The next releases should improve case quality, sandbox isolation, surrogate judgment quality, and operational dashboards.
+
+## Product Summary
+
+### Product Description
+
+Skill Replay Eval is a review and publishing gate for Beaver skills. It evaluates skill drafts against prior accepted task runs and shows whether the draft improves, preserves, or harms real task outcomes. It separates safe tool execution from surrogate evaluation for unsafe or unavailable tools, and it checks whether revised skill drafts preserve important original instructions.
+
+### Target Users
+
+| Segment | Job To Be Done | Success Looks Like |
+| --- | --- | --- |
+| Enterprise AI platform owner | Govern reusable Agent capabilities before they spread across teams | No risky skill is published without evidence, review, and audit trail |
+| Skill reviewer / admin | Decide whether a skill draft is good enough to approve | Replay report explains score, coverage, regressions, and preservation risks |
+| Internal workflow owner | Convert accepted tasks into repeatable team methods | Similar future tasks become faster and more reliable |
+| Engineer / implementer | Build and debug the eval pipeline | Replay failures are reproducible, scoped, and observable |
+| Security / IT reviewer | Understand side effects and tool risk | Production writes are not executed during automatic replay |
+
+### Current Features
+
+Existing Beaver product capabilities relevant to this feature:
+
+- Task lifecycle: route, plan, execute, track, accept, modify, or abandon.
+- Evidence and timeline: tool calls, artifacts, task status, and validation signals.
+- Skill learning: candidates, drafts, safety report, eval report, review, publish.
+- Multi-instance deployment: isolated `app-instance` per user/team via Docker.
+- Tool and connector framework: local tools, MCP tools, external connectors, files, web/search, scheduled tasks.
+
+Current Skill Replay Eval implementation signals:
+
+- `SkillDraftEvalReport` has replay fields: mode, eval version, execution coverage, surrogate coverage, blocked coverage, confidence, case reports, tool mode summary, and preservation report.
+- `select_replay_cases()` selects up to 10 accepted historical runs by candidate type.
+- `ReplayToolExecutor` classifies tool calls as executed, surrogate, or blocked.
+- `ReplayRunner` runs baseline and candidate arms through AgentLoop with a replay tool executor.
+- `SurrogateToolEvaluator` scores non-executed calls through deterministic intended-call heuristics.
+- Publish gates block low-confidence replay reports, fully blocked replay reports, and failed preservation reports.
+- Skills UI exposes execution coverage, surrogate coverage, confidence, replay cases, raw case reports, and preservation reports.
+
+### Current Architecture
+
+```text
+Accepted task runs
+  -> SkillLearningCandidate
+  -> SkillDraft
+  -> case selection
+  -> baseline arm and candidate arm
+  -> replay tool executor
+       -> executed tools for safe toolsets
+       -> surrogate traces for external writes or unsafe integrations
+       -> blocked traces for destructive calls
+  -> surrogate scoring and coverage aggregation
+  -> preservation checker for revise/merge
+  -> SkillDraftEvalReport
+  -> Skills review UI
+  -> publish gate
+```
+
+Product boundary:
+
+- Replay Eval should evaluate skill behavior, not replace human review.
+- Replay Eval should never write to production workspace, user files, external accounts, third-party systems, credentials, permissions, or payments by default.
+- Low confidence should increase review burden instead of creating false certainty.
+
+### Current Value Proposition
+
+For enterprise users, Beaver can say: "Accepted work can become reusable skills, and those skills are checked against real task behavior before they are published." This directly supports Beaver's larger promise of controlled, traceable, reusable Agent execution.
+
+### Current Challenges
+
+| Challenge | Product Impact | Current Risk |
+| --- | --- | --- |
+| Historical accepted runs may be sparse or low quality | Replay evidence can be weak | Medium |
+| Surrogate scoring is currently simple | Unsafe tool calls may be judged with low fidelity | High |
+| Replay environment isolation must be enforceable | Enterprise trust depends on no accidental production side effects | High |
+| Reviewers need clear explanations | Raw case reports can overwhelm non-engineers | Medium |
+| Publish gates may be too strict or too loose | Either slows adoption or lets regressions through | Medium |
+| Skill preservation is section-based | Important instruction changes inside a section may be missed | Medium |
+
+## Missing Information And Ambiguities
+
+- No real customer interview data is provided for skill reviewers, enterprise admins, or workflow owners.
+- No baseline metrics exist for current heuristic eval false positives or false negatives.
+- No defined quality threshold exists for minimum acceptable replay coverage per skill category.
+- No clear operational owner is assigned for replay failures, low confidence reports, or blocked tool classifications.
+- No explicit policy matrix exists per toolset, customer deployment mode, or tenant risk tier.
+- No customer-facing language has been finalized for explaining surrogate evaluation limitations.
+
+## User Segments
+
+### Segment 1: Skill Governance Admin
+
+This user owns skill approval. They need a reliable way to decide whether a skill should be published. Their main pain is that a skill draft can appear well-written but still fail on real tasks.
+
+### Segment 2: Enterprise AI Platform Buyer
+
+This user evaluates Beaver as an internal AI platform. They care about risk, adoption, cost, governance, and operational control. They need to see that reusable Agent capabilities are not published blindly.
+
+### Segment 3: Workflow Owner
+
+This user has repeatable work such as weekly reports, project delivery, technical support, or file processing. They want accepted workflows to become faster and more consistent over time.
+
+### Segment 4: Beaver Engineer / Operator
+
+This user debugs replay failures, expands safe tool coverage, adjusts publish gates, and keeps the eval pipeline reliable.
+
+## JTBD
+
+| User | Job Story | Current Alternative | Desired Outcome |
+| --- | --- | --- | --- |
+| Skill reviewer | When a skill draft is ready, I want to see whether it works on prior accepted tasks, so I can approve it with evidence | Read the draft manually | Approve, reject, or revise with confidence |
+| Admin | When a skill touches tools, I want to know what would execute, what is simulated, and what is blocked, so I can manage risk | Trust reviewer judgment | Clear coverage and side-effect evidence |
+| Workflow owner | When my accepted task becomes a reusable skill, I want it to preserve what made the original task successful | Rewrite prompts manually | Similar future work gets better |
+| Operator | When replay fails, I want to know whether the issue is provider, tool policy, case data, or candidate behavior | Read logs manually | Fast diagnosis and recovery |
+
+## Alternative Product Positioning
+
+| Positioning | Strength | Weakness | Recommendation |
+| --- | --- | --- | --- |
+| "Skill unit tests for Agents" | Easy for engineers to understand | Too narrow; suggests deterministic tests only | Use in engineering docs |
+| "Replay-based skill quality gate" | Accurate and product-relevant | Needs explanation for non-technical buyers | Primary internal positioning |
+| "Enterprise Agent governance evidence" | Strong for buyers | Less precise for builders | Use in sales and customer docs |
+| "A/B testing for skill drafts" | Captures baseline vs candidate | May imply live user traffic experiments | Use carefully |
+
+Recommended positioning:
+
+> Skill Replay Eval is Beaver's evidence-based quality gate for reusable Agent skills. It replays accepted historical tasks, compares baseline and candidate behavior, and exposes execution coverage, surrogate coverage, regressions, and preservation risk before publication.
+
+## Opportunity Areas
+
+| Opportunity | Importance | Current Satisfaction | Opportunity Score | Notes |
+| --- | ---: | ---: | ---: | --- |
+| I need proof that a skill draft improves real task behavior | 0.95 | 0.25 | 0.71 | Core opportunity |
+| I need automatic replay to avoid unsafe side effects | 0.95 | 0.35 | 0.62 | Required for enterprise trust |
+| I need reports that are understandable to reviewers | 0.85 | 0.35 | 0.55 | Key adoption driver |
+| I need preservation of existing skill instructions | 0.80 | 0.45 | 0.44 | Important for revisions |
+| I need replay failures to be diagnosable | 0.75 | 0.40 | 0.45 | Operational maturity |
+| I need configurable policy per deployment | 0.70 | 0.30 | 0.49 | Later enterprise hardening |
+
+Top opportunities:
+
+1. Evidence that a draft improves or preserves accepted task behavior.
+2. Safe replay with explicit executed/surrogate/blocked coverage.
+3. Reviewer-facing explanation that turns raw traces into decisions.
+
+## Product Expansion Ideas
+
+Generated from PM, Designer, and Engineer perspectives.
+
+### Product Manager Ideas
+
+1. Replay Readiness Score: show whether a draft has enough historical evidence before eval starts.
+2. Skill Release Gate Levels: allow advisory, strict, and regulated gates per workspace.
+3. Regression Triage Queue: collect failed cases and route them to skill authors.
+4. Customer-facing Audit Export: export replay report as PDF/Markdown for security review.
+5. Skill Quality Trend: show whether a skill improves or degrades across versions.
+
+### Product Designer Ideas
+
+1. Reviewer Decision View: summarize "approve / revise / reject" with reasons before raw JSON.
+2. Coverage Timeline: visualize executed, surrogate, and blocked calls per case.
+3. Preservation Diff: show dropped or changed sections in a readable comparison.
+4. Replay Case Drilldown: task text, expected behavior, baseline output, candidate output, and validator notes.
+5. Confidence Language: translate low/medium/high confidence into concrete reviewer actions.
+
+### Engineer Ideas
+
+1. Pluggable Tool Policy Registry: classify tools by toolset, transport, metadata, and deployment risk.
+2. Deterministic Replay Fixtures: save replay inputs and traces for reproducible debugging.
+3. Sandbox User File Namespace: isolate user-file writes per replay arm.
+4. LLM Surrogate Provider: replace deterministic heuristics with structured model judgment when available.
+5. Replay Telemetry: metrics for replay latency, failure mode, blocked coverage, and provider availability.
+
+Top 5 selected ideas:
+
+| Rank | Idea | Why Selected | Assumptions To Validate |
+| ---: | --- | --- | --- |
+| 1 | Reviewer Decision View | Converts technical eval into action | Reviewers trust summarized recommendations |
+| 2 | Sandbox User File Namespace | Directly addresses production side-effect risk | Existing file tooling can be redirected cleanly |
+| 3 | LLM Surrogate Provider | Improves unsafe tool judgment quality | LLM judgment is consistent enough for review support |
+| 4 | Replay Readiness Score | Prevents weak reports from appearing authoritative | Enough metadata exists to estimate readiness |
+| 5 | Preservation Diff | Makes revision risk visible and actionable | Section and body-level diffs catch meaningful drops |
+
+## Key Assumptions
+
+| Assumption | Category | Impact | Uncertainty |
+| --- | --- | ---: | ---: |
+| Accepted historical runs are representative enough to evaluate future skill behavior | Value | High | High |
+| Reviewers will use replay reports to make better publish decisions | Value | High | Medium |
+| Safe tools can execute in isolation without leaking state or causing production side effects | Feasibility | High | High |
+| Surrogate evaluation can judge unsafe tool calls well enough to support review | Feasibility | High | High |
+| Coverage and confidence are understandable to non-engineer reviewers | Usability | Medium | High |
+| Publish gates will reduce risky releases without blocking too many useful skills | Viability | High | Medium |
+| Skill preservation can be detected with lightweight section checks in v1 | Feasibility | Medium | Medium |
+| Replay latency will be acceptable for review workflows | Usability | Medium | Medium |
+| Customers will value replay eval enough to differentiate Beaver from generic Agent tools | Business Viability | High | Medium |
+| The team can maintain tool policy as tools/connectors grow | Team Capability | High | Medium |
+
+## Prioritized Assumptions
+
+Priority = Impact x Uncertainty.
+
+### P0 Validate Immediately
+
+| Assumption | Why It Matters | What Could Go Wrong | Suggested Validation |
+| --- | --- | --- | --- |
+| Safe replay isolation is real, not only conceptual | One accidental external write can break trust | Replay calls production filesystem, connector, or credential paths | Technical isolation test with destructive and external-write tools |
+| Replay reports help reviewers make better decisions | Product value depends on review decisions changing | Reports are too raw, ignored, or misunderstood | Reviewer usability test with 5 draft decisions |
+| Surrogate evaluation is good enough for unsafe tools | Many enterprise tools cannot execute in replay | It rubber-stamps bad calls or flags good calls | Golden set of unsafe tool scenarios scored by humans vs surrogate |
+| Historical accepted cases are adequate for eval | Weak cases create false confidence | Too few accepted runs or repetitive cases | Analyze real run store coverage across skills |
+
+### P1 Important
+
+| Assumption | Why It Matters | Validation |
+| --- | --- | --- |
+| Publish gate thresholds are calibrated | Prevents both overblocking and underblocking | Run shadow mode for 2 weeks and compare human decisions |
+| Preservation checker catches meaningful draft regressions | Revision safety depends on it | Compare section checker with manual diff review |
+| Replay latency fits review workflow | Slow eval hurts adoption | Measure p50/p95 per case and per draft |
+| Customers understand confidence and coverage language | Trust depends on clear communication | Customer-facing report comprehension test |
+
+### P2 Later
+
+| Assumption | Why It Matters | Validation |
+| --- | --- | --- |
+| Per-tool policy UI is needed | May not be needed in v1 | Observe support/admin requests |
+| Audit export becomes a buying requirement | Useful for enterprise sales | Ask pilot buyers during procurement review |
+| Skill quality trend is a major retention driver | Useful after multiple versions exist | Measure repeat reviewer usage after v1 |
+
+## Opportunity Solution Tree
+
+Desired outcome:
+
+> Increase trusted skill publication: at least 80% of approved skill drafts have replay or explicit skipped-provider evidence, zero known production side effects from replay, and reviewer decision time under 10 minutes for common drafts.
+
+```text
+Outcome: Trusted skill publication
+
+Opportunity 1: I need proof that a skill draft improves real task behavior.
+  Solution 1.1: Baseline vs candidate replay on accepted historical tasks.
+    Experiment: Run replay on 10 recent skill drafts and compare with manual reviewer judgment.
+  Solution 1.2: Replay readiness score before evaluation starts.
+    Experiment: Score existing candidates and check whether low-readiness reports are less useful.
+  Solution 1.3: Regression triage queue.
+    Experiment: Manually label failed cases for two weeks and measure fix rate.
+
+Opportunity 2: I need replay to avoid unsafe side effects.
+  Solution 2.1: Tool mode classification: executed, surrogate, blocked.
+    Experiment: Golden tool policy test set covering filesystem, MCP, connectors, delete, send, publish.
+  Solution 2.2: Isolated workspace and user-file namespace per arm.
+    Experiment: Replay write task and verify no production paths change.
+  Solution 2.3: Side-effect journal in each case report.
+    Experiment: Security reviewer reads 5 reports and identifies all intended side effects.
+
+Opportunity 3: I need reports I can act on.
+  Solution 3.1: Reviewer decision summary with approve/revise/reject guidance.
+    Experiment: First-click and decision-time test with reviewers.
+  Solution 3.2: Coverage and confidence explanation.
+    Experiment: Ask reviewers to explain report meaning after reading it.
+  Solution 3.3: Preservation diff for revisions.
+    Experiment: Seed dropped-instruction drafts and measure detection rate.
+```
+
+## Validation Experiments
+
+| P0 Assumption | Hypothesis | Experiment | Cost | Duration | Success Criteria | Failure Criteria |
+| --- | --- | --- | --- | --- | --- | --- |
+| Safe replay isolation | Replay can execute safe tools without touching production state | Build a replay fixture that writes, reads, sends, deletes, and publishes through classified tools | Medium | 2-4 days | 100% production paths untouched; destructive calls blocked; external writes surrogate | Any real external write or production path mutation |
+| Reviewer decision value | Replay reports improve approval accuracy and speed | Give 5 reviewers 8 historical drafts with and without replay report | Low | 2 days | Decision accuracy +25%; median decision time under 10 minutes | No improvement or reports misunderstood |
+| Surrogate quality | Surrogate scoring agrees with human reviewer on unsafe tool calls | Create 30 unsafe-tool scenarios and compare human labels vs surrogate output | Medium | 3-5 days | >=80% agreement on pass/fail; all high-risk bad calls flagged | High-risk false pass |
+| Historical case adequacy | Accepted runs provide enough useful replay cases | Audit run store across top 10 skills/candidates | Low | 1 day | >=70% candidates have >=3 meaningful accepted cases | Most candidates have no usable cases |
+
+## Feature Prioritization
+
+### Must Have
+
+| Feature | Impact | Effort | Risk | Strategic Alignment |
+| --- | --- | --- | --- | --- |
+| Eval report compatibility fields | High | Low | Low | Required foundation |
+| Historical accepted case selection | High | Medium | Medium | Required for behavior evidence |
+| Baseline vs candidate replay arms | High | High | High | Core value |
+| Tool mode classification | High | Medium | High | Core trust boundary |
+| Replay coverage and confidence report | High | Medium | Medium | Reviewer decision support |
+| Publish gates for failed/low-confidence replay | High | Low | Medium | Governance promise |
+| Preservation check for revise/merge drafts | Medium | Medium | Medium | Prevents silent instruction loss |
+| Skills UI report summary | High | Medium | Medium | Adoption requirement |
+
+### Should Have
+
+| Feature | Impact | Effort | Risk | Strategic Alignment |
+| --- | --- | --- | --- | --- |
+| Reviewer decision summary | High | Medium | Medium | Converts evidence to action |
+| Preservation diff view | Medium | Medium | Low | Improves reviewer comprehension |
+| Replay readiness score | Medium | Medium | Medium | Prevents false confidence |
+| Operational metrics dashboard | Medium | Medium | Low | Needed for maintenance |
+| Golden tool policy test suite | High | Medium | Medium | Needed for safety assurance |
+
+### Could Have
+
+| Feature | Impact | Effort | Risk | Strategic Alignment |
+| --- | --- | --- | --- | --- |
+| Audit export | Medium | Medium | Low | Enterprise sales support |
+| Skill quality trend | Medium | Medium | Medium | Useful after version history grows |
+| Per-tool admin policy UI | Medium | High | Medium | Enterprise customization |
+| Replay fixtures download | Low | Medium | Low | Debugging convenience |
+
+### Not Yet
+
+| Feature | Reason |
+| --- | --- |
+| Full Docker orchestration per replay case | Too heavy for first release; design explicitly scopes it out |
+| Production third-party write replay | Violates trust boundary |
+| Removing human review | Replay evidence should support review, not replace it |
+| Fully customizable policy UI | Add after policy needs are observed |
+
+Features to cut from v1:
+
+- Per-tool policy UI.
+- Audit export.
+- Skill quality trend.
+- Full Docker-per-case orchestration.
+
+Features likely over-engineered for v1:
+
+- Customer-configurable replay policies before default policy is proven.
+- Complex statistical scoring before case quality and surrogate accuracy are validated.
+- Automatic publish for high-scoring drafts.
+
+## Metrics Dashboard
+
+### North Star Metric
+
+Trusted Skill Publish Rate:
+
+> Approved skill drafts with usable eval evidence and no post-publish regression reports / total approved skill drafts, measured weekly.
+
+Target for v1 pilot: >=80%.
+
+### Input Metrics
+
+| Metric | Definition | Data Source | Visualization | Target | Alert Threshold |
+| --- | --- | --- | --- | --- | --- |
+| Replay Evidence Coverage | Draft eval reports with mode `replay` or explicit skipped-provider status / all eval reports | Skill eval store | Weekly line | >=80% | <60% for 2 weeks |
+| Executed Tool Coverage | Executed tool calls / all replay tool calls | Case reports | Stacked bar | >=50% for safe-tool skills | <25% for safe-tool skills |
+| Surrogate Coverage | Surrogate tool calls / all replay tool calls | Case reports | Stacked bar | Transparent, not necessarily low | Sudden +30% week over week |
+| Blocked Coverage | Blocked tool calls / all replay tool calls | Case reports | Stacked bar | <10% | >=25% or any blocked_coverage=1.0 |
+| Reviewer Decision Time | Time from eval report created to approve/reject/revise | Review events | Median and p95 | Median <10 min | p95 >30 min |
+| Replay Regression Rate | Reports with regression_count > 0 / replay reports | Eval store | Weekly line | Investigate, not zero-forced | >20% |
+
+### Leading Indicators
+
+- Number of accepted runs eligible for replay per skill.
+- Percentage of candidates with at least 3 replay cases.
+- Provider unavailable skip rate.
+- Replay error or partial status rate.
+- Preservation failures per revised skill draft.
+
+### Guardrail Metrics
+
+| Guardrail | Definition | Alert |
+| --- | --- | --- |
+| Production Side Effect Incidents | Any replay-caused write to production workspace, user files, credentials, or external systems | Immediate P0 |
+| False Pass Incidents | Published draft later confirmed to regress an accepted workflow despite passing replay | Weekly review; P1 if repeated |
+| False Block Incidents | Useful draft blocked due to bad policy or low-confidence bug | Weekly review |
+| Replay Latency | p95 replay completion time per draft | Alert if p95 >15 minutes in pilot |
+| Report Comprehension | Reviewers correctly explain coverage/confidence in usability tests | Rework UI copy if <80% |
+
+### Review Cadence
+
+- Daily during pilot: replay errors, side-effect alerts, provider skips.
+- Weekly: publish outcomes, regression rate, reviewer decision time, blocked/surrogate coverage.
+- Monthly: threshold calibration and customer feedback.
+- Quarterly: policy model, scoring model, and roadmap review.
+
+## Customer Research Plan
+
+No customer interviews or support tickets were provided. Run research before treating demand and usability assumptions as validated.
+
+### Research Participants
+
+- 3-5 internal skill reviewers or admins.
+- 3 workflow owners who want accepted tasks converted into reusable skills.
+- 2 enterprise/security stakeholders who review AI governance.
+- 2 engineers/operators responsible for deployment and incident response.
+
+### Research Questions
+
+- What evidence do reviewers need before approving a reusable skill?
+- Which replay report fields are meaningful, and which are noise?
+- Do users understand executed vs surrogate vs blocked coverage?
+- What level of uncertainty is acceptable for publishing?
+- What customer-facing proof is needed for enterprise pilots?
+- Which tool categories must never execute during replay?
+
+### Recommended Actions
+
+- Run a moderated reviewer test with current Skills page report.
+- Create 5 seeded draft cases: clear improvement, clear regression, unsafe external write, preservation drop, provider unavailable.
+- Ask participants to approve/revise/reject each case and explain why.
+- Compare their decisions with current publish gate behavior.
+
+## Interview Guide
+
+### Objectives
+
+- Validate whether replay evidence changes approval behavior.
+- Identify confusing report language.
+- Understand risk tolerance for surrogate and blocked calls.
+- Learn what artifacts enterprise buyers need for adoption.
+
+### Warm-Up
+
+- Tell me about the last time you reviewed or approved reusable AI guidance, prompts, tools, or workflows.
+- What made the approval easy or hard?
+- What happened after it was approved?
+
+### JTBD Questions
+
+- Walk me through the last time an AI workflow worked well enough that you wanted to reuse it.
+- What evidence did you have that it would work again?
+- What would make you hesitate to publish it for others?
+- What does "safe to publish" mean in your environment?
+
+### Behavioral Questions
+
+- Show me how you would decide whether this draft should be approved.
+- Which part of this report would you read first?
+- What would you ignore?
+- What would you ask an engineer to explain?
+
+### Risk Validation Questions
+
+- If a replay report says 70% executed and 30% surrogate, what decision would you make?
+- If all important external writes were surrogate-evaluated, is that enough for review?
+- Which tools should always be blocked in your environment?
+- What kind of failure would make you disable replay eval?
+
+### Note Template
+
+```text
+Participant:
+Role:
+Date:
+Last relevant review:
+Decision evidence needed:
+Confusing report fields:
+Risk tolerance:
+Must-block tool categories:
+Minimum publish evidence:
+Unexpected insight:
+Follow-up:
+```
+
+## Recommended Next 30 Days
+
+1. Validate replay isolation with a golden tool policy suite.
+2. Run current backend unit tests around skill learning replay and publish gates.
+3. Add a small reviewer decision summary above raw replay details.
+4. Run 5-8 reviewer usability sessions using seeded draft cases.
+5. Audit accepted run coverage for top skills and identify gaps.
+6. Decide v1 gate thresholds for blocked coverage, confidence, and preservation failure.
+7. Add operational logging and metrics for replay status, latency, and provider skips.
+
+## Recommended Next 90 Days
+
+1. Replace or augment deterministic surrogate scoring with structured LLM judgment and human-labeled calibration cases.
+2. Add replay readiness scoring before eval starts.
+3. Improve preservation from section presence to diff-based critical instruction detection.
+4. Add customer/exportable audit summary for enterprise pilot conversations.
+5. Build a replay operations dashboard.
+6. Introduce deployment-level policy profiles only after default policies produce stable data.
+7. Track skill quality across versions and post-publish regression reports.
+
+## Biggest Risks
+
+| Risk | Severity | Mitigation |
+| --- | --- | --- |
+| Replay accidentally mutates production state | Critical | Golden policy tests, isolated namespaces, external writes surrogate by default, P0 alert |
+| Surrogate scoring gives false confidence | High | Human-labeled calibration set, show low confidence clearly, no automatic publish |
+| Reviewers ignore report complexity | High | Decision summary, comprehension testing, action-oriented UI copy |
+| Accepted run data is too sparse | High | Readiness score, fallback to explicit skipped/low-evidence state, collect more accepted cases |
+| Publish gates block too many useful skills | Medium | Shadow mode calibration and override with explicit review rationale |
+| Evaluation costs or latency grow quickly | Medium | Cap cases, cache web/search, track p95 latency, async background eval |
+
+## Recommended Immediate Actions
+
+1. Treat Skill Replay Eval as a v1 trust gate, not a complete benchmark.
+2. Keep human review mandatory for publish.
+3. Do not execute production third-party writes during automatic replay.
+4. Add reviewer-facing explanations before adding more raw report data.
+5. Validate isolation and surrogate quality before broad rollout.
+6. Use the first pilot to learn threshold calibration, not to claim perfect quality measurement.
--- a/docs/superpowers/plans/2026-06-08-skill-replay-eval.md
+++ b/docs/superpowers/plans/2026-06-08-skill-replay-eval.md
--- a/docs/superpowers/specs/2026-06-08-skill-replay-eval-design.md
+++ b/docs/superpowers/specs/2026-06-08-skill-replay-eval-design.md
@ -0,0 +1,219 @@
+# Skill Replay Eval Design
+
+Related product planning artifacts:
+
+- [Product Discovery Report](../../product-discovery/skill-replay-eval/product-discovery-report.md)
+- [PRD](../../product-discovery/skill-replay-eval/PRD-skill-replay-eval.md)
+- [Launch And Maintenance Runbook](../../product-discovery/skill-replay-eval/launch-maintenance-runbook.md)
+
+## Goal
+
+Improve skill draft evaluation so it measures real task behavior instead of relying on heuristic draft scoring. The new evaluation must cover every tool involved in a skill, while separating tools that can be executed safely from tools that require LLM surrogate judgment.
+
+This design also fixes revision draft generation dropping important content from the original skill by making base skill preservation an explicit contract.
+
+## Current State
+
+`SkillDraftEvaluator` currently builds a lightweight report from `candidate.source_run_ids`. It scores each historical run from `validation_result.score` or success fallback, then estimates candidate score from draft text. It does not replay the task, does not execute tools, and does not compare old skill behavior with draft skill behavior.
+
+`SkillDraftSynthesizer` currently receives candidate reason, related skill names, tool names, task summaries, and session excerpts. For revision and merge drafts, it does not receive the full base skill frontmatter and body, so generated drafts can accidentally omit important original instructions.
+
+## Design Principles
+
+- All tools are part of evaluation coverage.
+- Safe tools execute in an isolated replay environment.
+- Unsafe or unavailable tools are not ignored; they are evaluated through an LLM surrogate using intended tool calls, schema, arguments, historical evidence, and expected effects.
+- Evaluation reports must disclose execution coverage and surrogate coverage separately.
+- Revision drafts must preserve original skill content unless a change is explicitly justified.
+- Replay runs must not write to production workspace, user files, memory, third-party accounts, or external systems by default.
+
+## Evaluation Model
+
+Each draft eval selects up to 10 historical cases. If fewer than 10 eligible cases exist, use as many as available. If more than 10 exist, select the 10 most relevant cases.
+
+For `revise_skill`, select accepted historical runs that activated the target skill/version. Prefer recent accepted runs, then diversify by task and session.
+
+For `new_skill`, select candidate source runs and accepted runs with similar task themes.
+
+For `merge_skills`, select accepted runs where the related skills co-activated.
+
+Each case runs two arms:
+
+- Baseline arm: no skill for `new_skill`, old skill for `revise_skill`, or old related skills for `merge_skills`.
+- Candidate arm: draft skill injected as pinned draft guidance.
+
+Both arms use the same task text, same bounded historical context, same model settings, same max tool iterations, and same replay policy.
+
+## Tool Execution Modes
+
+Each tool call in replay resolves to one of these modes:
+
+- `executed`: Tool was safely executed in replay environment.
+- `surrogate`: Tool was not executed, but the intended call and expected effect were evaluated by LLM.
+- `blocked`: Tool could not be executed or judged reliably.
+
+The goal is not to exclude third-party tools. It is to include them with the strongest safe evaluation method available.
+
+Examples:
+
+- Filesystem reads and writes run against a temporary workspace clone.
+- User file writes run against a temporary user-file namespace when available.
+- Web/search reads can execute and cache outputs.
+- Email/calendar/message sending to production systems does not execute by default. The replay records the intended call and evaluates it through surrogate judgment unless a sandbox/test connector is configured.
+- Destructive actions such as delete, payment, permission changes, or irreversible external writes default to surrogate or blocked.
+
+## Replay Environment
+
+The replay runner creates isolated state per case and arm:
+
+- Temporary session id.
+- Temporary workspace root.
+- Temporary task id or replay id.
+- Tool call trace.
+- Output artifacts.
+- Side-effect journal.
+- Captured final answer.
+
+This follows the OfficeBench MCP pattern: run tools in an isolated testbed where possible, pull artifacts and state after execution, then evaluate outside the runner. Beaver should reuse this shape without depending on OfficeBench's fixed benchmark functions.
+
+## Surrogate Evaluation
+
+When a tool cannot be safely executed, the agent should still be allowed to plan or attempt the tool call. The replay layer records:
+
+- Tool name.
+- Tool schema.
+- Arguments.
+- Tool classification reason.
+- Historical accepted evidence.
+- Expected side effect inferred from the task.
+- Any assistant rationale around the call.
+
+The surrogate evaluator compares baseline and candidate intended effects. It scores whether the intended tool use would satisfy the task, whether arguments are complete and correct, and whether the call is risky, missing, duplicated, or unnecessary.
+
+Surrogate scoring contributes to the final candidate score, but lowers confidence compared with real execution.
+
+## Scoring
+
+Each case produces:
+
+- `baseline_score`
+- `candidate_score`
+- `delta`
+- `execution_coverage`
+- `surrogate_coverage`
+- `blocked_tool_count`
+- `confidence`
+- `tool_calls`
+- `artifacts`
+- `side_effects`
+- `validator_notes`
+
+The draft report aggregates:
+
+- Baseline mean.
+- Candidate mean.
+- Score delta.
+- Improved count.
+- Regression count.
+- Unchanged count.
+- Execution coverage.
+- Surrogate coverage.
+- Blocked coverage.
+- Confidence.
+
+Publish gates should consider both score and confidence. A passing score with low confidence should require stronger human review, not automatic trust.
+
+## Draft Preservation
+
+Revision and merge synthesis must include base skill snapshots:
+
+- Base skill name.
+- Base version.
+- Full base frontmatter.
+- Full base content.
+- Tool hints.
+- Current published summary.
+
+The synthesis prompt must require the model to preserve existing instructions unless it explicitly changes them. The output remains a full proposed skill body, but it should also include:
+
+- `preserved_sections`
+- `changed_sections`
+- `dropped_sections`
+- `change_reason`
+
+After generation, a preservation checker compares base content and draft content. If critical sections disappear without explanation, the draft eval should mark preservation risk and require revision before approval.
+
+## API And Storage
+
+The existing `SkillDraftEvalReport` should be extended rather than replaced.
+
+Add fields for:
+
+- `eval_version`
+- `mode`, with values such as `heuristic`, `replay`
+- `execution_coverage`
+- `surrogate_coverage`
+- `blocked_coverage`
+- `confidence`
+- `case_reports`
+- `tool_mode_summary`
+- `preservation_report`
+
+The existing simple fields remain for UI compatibility: `passed`, `baseline_score_avg`, `candidate_score_avg`, `score_delta`, `regression_count`, `improved_count`, `unchanged_count`, `cases`, and `status`.
+
+## UI
+
+The Skills draft review page should continue to show a concise summary first:
+
+- Passed or failed.
+- Baseline mean.
+- Candidate mean.
+- Delta.
+- Execution coverage.
+- Surrogate coverage.
+- Confidence.
+
+Detailed sections show:
+
+- Replay cases.
+- Tool calls by mode.
+- Blocked or surrogate reasons.
+- Artifacts and side effects.
+- Preservation report for revision drafts.
+- Raw eval payload.
+
+The user should not need to configure per-tool policies for normal use. The report should explain coverage and uncertainty after the fact.
+
+## Error Handling
+
+If replay infrastructure fails before any case runs, eval status is `replay_error` and the draft cannot rely on replay pass.
+
+If some cases fail but others complete, eval status is `partial` and confidence is reduced.
+
+If a provider is unavailable, keep the current skipped-provider behavior but mark the report as no replay evidence.
+
+If all important tool calls become `blocked`, the draft should not pass automatically even if surrogate scoring is high.
+
+## Testing
+
+Unit tests should cover:
+
+- Historical case selection for new, revise, merge candidates.
+- Baseline and candidate arm construction.
+- Tool mode classification and aggregation.
+- Surrogate scoring payload construction.
+- Preservation checker behavior.
+- Publish gate behavior for low-confidence or blocked reports.
+
+Integration-style tests should use stub tools:
+
+- A safe filesystem write tool that writes to temp workspace.
+- An external write tool that is intercepted into surrogate mode.
+- A mixed case where candidate improves one real artifact and one surrogate side effect.
+
+## Out Of Scope
+
+- Real production third-party writes during automatic replay.
+- Full Docker orchestration for all Beaver replay cases in the first implementation.
+- Per-tool user policy UI.
+- Replacing human review. Replay improves evidence but does not remove review gates.
Author	SHA1	Message	Date
steven_li	fc9fd93c36	feat: 支持多语言提示词本地化和界面优化 - 添加 prompt_locale 参数支持简体中文、繁体中文和英文提示词本地化 - 移除内置 agents 配置以简化系统架构 - 更新 ContextBuilder 使用动态提示词模板而非硬编码内容 - 在 AgentLoop、Web 接口和 AgentService 中传递 locale 参数 - 添加输出语言指令确保用户界面内容按指定语言生成 - 扩展前端 LanguageSwitcher 组件支持三种语言选项 - 优化 Header 和侧边栏组件的响应式布局和文本截断处理 - 更新测试用例验证不同语言环境下的提示词正确性	2026-06-10 16:11:05 +08:00
steven_li	9cc3334ea7	``` feat(app-instance): 添加Outlook MCP调用超时配置选项新增OUTLOOK_MCP_CALL_TIMEOUT_SECONDS环境变量，默认值为60秒，用于控制后端等待Outlook MCP调用的超时时间。在create-instance.sh脚本中添加了相应的命令行参数解析和处理逻辑，同时更新了deploy-control组件的相关配置和测试用例。 BREAKING CHANGE: 新增配置项可能需要现有部署进行相应调整。 ```	2026-06-09 14:23:37 +08:00
steven_li	dc4c6f313d	fix(providers): avoid chat template body for vllm mistral	2026-06-09 13:19:09 +08:00
steven_li	9e2c02a333	feat(skills-ui): show replay eval coverage	2026-06-08 13:38:10 +08:00
steven_li	b9171998b9	feat(skill-learning): gate publish on replay confidence	2026-06-08 13:36:55 +08:00
steven_li	64d789a3d0	feat(skill-learning): produce replay eval reports	2026-06-08 13:35:58 +08:00
steven_li	cc1bf85517	feat(skill-learning): run replay arms through agent loop	2026-06-08 13:33:53 +08:00
steven_li	4c8bc53d33	feat(skill-learning): add surrogate tool evaluator	2026-06-08 13:33:02 +08:00
steven_li	70014c0f70	feat(engine): allow replay tool executor injection	2026-06-08 13:32:14 +08:00
steven_li	eb69bb168a	feat(skill-learning): add replay tool policy	2026-06-08 13:31:13 +08:00
steven_li	7287e93f87	feat(skill-learning): select replay eval cases	2026-06-08 13:30:00 +08:00
steven_li	a925f0e77f	feat(skill-learning): preserve base skill during synthesis	2026-06-08 13:28:41 +08:00
steven_li	6dc580ab26	feat(skill-learning): add draft preservation checks	2026-06-08 13:27:10 +08:00
steven_li	3a16dc283d	feat(skill-learning): extend eval report payload	2026-06-08 13:26:12 +08:00
steven_li	0fd4df3c17	docs: plan skill replay eval implementation	2026-06-08 11:26:07 +08:00
steven_li	f46a435bab	docs: refine skill replay case selection	2026-06-08 10:46:35 +08:00
steven_li	a28254c6b8	docs: design skill replay eval	2026-06-08 10:29:39 +08:00