# Sample Data Spec ## 目标 这个文档定义 SOC Memory POC 在无真实数据阶段使用的 mock 数据格式,用于: - 验证 ingestion pipeline - 验证标准化脚本 - 验证 context retrieval - 验证 case summary 与 memory commit 流程 当前只覆盖两类场景: - 钓鱼邮件 - O365 异常登录 / 疑似账号被盗 ## 目录约定 ```text evaluation/datasets/ ├── mock_cases/ │ ├── phishing/ │ └── o365_suspicious_login/ └── mock_kb/ ├── playbooks/ ├── kb/ └── reports/ ``` ## Mock Case 原始格式 每个 case 使用一个 JSON 文件,文件名建议: ```text .json ``` ### 字段定义 | 字段 | 类型 | 必填 | 说明 | |---|---|---:|---| | `case_id` | string | 是 | case 唯一 ID | | `title` | string | 是 | 简短标题 | | `scenario` | string | 是 | `phishing` 或 `o365_suspicious_login` | | `alert_type` | string | 是 | 告警类型 | | `severity` | string | 是 | `low` / `medium` / `high` / `critical` | | `status` | string | 是 | `confirmed` / `false_positive` / `pending` | | `time_window` | object | 是 | 开始和结束时间 | | `summary` | string | 是 | 一句话摘要 | | `alert_source` | string | 是 | 告警来源系统 | | `entities` | object | 是 | 关键实体 | | `observables` | object | 否 | IOC/可观测对象 | | `evidence` | array | 是 | 关键证据列表 | | `investigation_steps` | array | 是 | 关键调查步骤 | | `conclusion` | object | 是 | 研判结论 | | `related_refs` | object | 否 | 相关 KB / playbook / case | | `lessons_learned` | array | 否 | 复用经验 | | `tags` | array | 否 | 标签 | ### 示例骨架 ```json { "case_id": "CASE-2026-0001", "title": "Potential phishing email targeting finance user", "scenario": "phishing", "alert_type": "mail_suspicious_attachment", "severity": "high", "status": "confirmed", "time_window": { "start": "2026-04-01T09:10:00+08:00", "end": "2026-04-01T11:30:00+08:00" }, "summary": "Finance user received an invoice-themed phishing email with a malicious HTML attachment.", "alert_source": "Secure Email Gateway", "entities": { "users": ["alice@corp.example"], "hosts": ["FIN-LAPTOP-12"], "mailboxes": ["alice@corp.example"] }, "observables": { "sender_emails": ["billing@vendor-payments.com"], "domains": ["vendor-payments.com"], "urls": ["https://vendor-payments-login.com/review"], "hashes": ["sha256:..."], "ips": ["198.51.100.20"] }, "evidence": [ "The sender domain was newly observed and failed DMARC.", "The attachment redirected the user to a credential harvesting page." ], "investigation_steps": [ "Validate sender reputation and authentication results.", "Detonate attachment in sandbox.", "Check click telemetry and account sign-in logs." ], "conclusion": { "verdict": "true_positive", "reason": "Multiple aligned phishing indicators and confirmed click behavior.", "recommended_actions": [ "Reset the impacted account password.", "Block the sender domain and landing URL." ] }, "related_refs": { "playbooks": ["PB-PHISH-001"], "kb": ["KB-PHISH-HEADER-CHECK"], "cases": [] }, "lessons_learned": [ "Invoice-themed phishing remains effective against finance users." ], "tags": ["phishing", "email", "credential-harvest"] } ``` ## Mock KB / Playbook 原始格式 每个知识条目使用一个 JSON 文件,文件名建议: ```text .json ``` ### 字段定义 | 字段 | 类型 | 必填 | 说明 | |---|---|---:|---| | `doc_id` | string | 是 | 文档唯一 ID | | `doc_type` | string | 是 | `kb` / `playbook` / `report_summary` | | `title` | string | 是 | 标题 | | `scenario` | string | 是 | 适用场景 | | `summary` | string | 是 | 核心摘要 | | `applicability` | array | 否 | 适用条件 | | `key_points` | array | 是 | 核心知识点 | | `investigation_guidance` | array | 否 | 调查建议 | | `decision_points` | array | 否 | 判定关键点 | | `related_entities` | object | 否 | 相关实体/TTP/IOC | | `related_refs` | object | 否 | 相关文档 | | `tags` | array | 否 | 标签 | | `updated_at` | string | 否 | 更新时间 | ## 标准化输出目标 ### 标准化后的 Case 结构 标准化脚本输出建议字段: - `id` - `memory_type` = `case` - `scenario` - `title` - `abstract` - `verdict` - `severity` - `entities` - `observables` - `evidence` - `patterns` - `related_refs` - `source_path` - `tags` ### 标准化后的 KB 结构 标准化脚本输出建议字段: - `id` - `memory_type` = `knowledge` - `doc_type` - `scenario` - `title` - `abstract` - `key_points` - `investigation_guidance` - `decision_points` - `related_refs` - `source_path` - `tags` ## 检索测试建议 在 mock 数据阶段,优先验证: - 钓鱼 case 是否能召回 phishing playbook 和相似 phishing case - O365 登录异常 case 是否能召回登录异常 KB 和相似 case - 真报与误报 case 是否能被区分并保留不同模式 - 召回结果是否包含关键 evidence / decision points