Sample Data Spec

目标

这个文档定义 SOC Memory POC 在无真实数据阶段使用的 mock 数据格式，用于：

验证 ingestion pipeline
验证标准化脚本
验证 context retrieval
验证 case summary 与 memory commit 流程

当前只覆盖两类场景：

钓鱼邮件
O365 异常登录 / 疑似账号被盗

目录约定

evaluation/datasets/
├── mock_cases/
│   ├── phishing/
│   └── o365_suspicious_login/
└── mock_kb/
    ├── playbooks/
    ├── kb/
    └── reports/

Mock Case 原始格式

每个 case 使用一个 JSON 文件，文件名建议：

<case_id>.json

字段定义

字段	类型	必填	说明
`case_id`	string	是	case 唯一 ID
`title`	string	是	简短标题
`scenario`	string	是	`phishing` 或 `o365_suspicious_login`
`alert_type`	string	是	告警类型
`severity`	string	是	`low` / `medium` / `high` / `critical`
`status`	string	是	`confirmed` / `false_positive` / `pending`
`time_window`	object	是	开始和结束时间
`summary`	string	是	一句话摘要
`alert_source`	string	是	告警来源系统
`entities`	object	是	关键实体
`observables`	object	否	IOC/可观测对象
`evidence`	array	是	关键证据列表
`investigation_steps`	array	是	关键调查步骤
`conclusion`	object	是	研判结论
`related_refs`	object	否	相关 KB / playbook / case
`lessons_learned`	array	否	复用经验
`tags`	array	否	标签

示例骨架

{
  "case_id": "CASE-2026-0001",
  "title": "Potential phishing email targeting finance user",
  "scenario": "phishing",
  "alert_type": "mail_suspicious_attachment",
  "severity": "high",
  "status": "confirmed",
  "time_window": {
    "start": "2026-04-01T09:10:00+08:00",
    "end": "2026-04-01T11:30:00+08:00"
  },
  "summary": "Finance user received an invoice-themed phishing email with a malicious HTML attachment.",
  "alert_source": "Secure Email Gateway",
  "entities": {
    "users": ["alice@corp.example"],
    "hosts": ["FIN-LAPTOP-12"],
    "mailboxes": ["alice@corp.example"]
  },
  "observables": {
    "sender_emails": ["billing@vendor-payments.com"],
    "domains": ["vendor-payments.com"],
    "urls": ["https://vendor-payments-login.com/review"],
    "hashes": ["sha256:..."],
    "ips": ["198.51.100.20"]
  },
  "evidence": [
    "The sender domain was newly observed and failed DMARC.",
    "The attachment redirected the user to a credential harvesting page."
  ],
  "investigation_steps": [
    "Validate sender reputation and authentication results.",
    "Detonate attachment in sandbox.",
    "Check click telemetry and account sign-in logs."
  ],
  "conclusion": {
    "verdict": "true_positive",
    "reason": "Multiple aligned phishing indicators and confirmed click behavior.",
    "recommended_actions": [
      "Reset the impacted account password.",
      "Block the sender domain and landing URL."
    ]
  },
  "related_refs": {
    "playbooks": ["PB-PHISH-001"],
    "kb": ["KB-PHISH-HEADER-CHECK"],
    "cases": []
  },
  "lessons_learned": [
    "Invoice-themed phishing remains effective against finance users."
  ],
  "tags": ["phishing", "email", "credential-harvest"]
}

Mock KB / Playbook 原始格式

每个知识条目使用一个 JSON 文件，文件名建议：

<doc_id>.json

字段定义

字段	类型	必填	说明
`doc_id`	string	是	文档唯一 ID
`doc_type`	string	是	`kb` / `playbook` / `report_summary`
`title`	string	是	标题
`scenario`	string	是	适用场景
`summary`	string	是	核心摘要
`applicability`	array	否	适用条件
`key_points`	array	是	核心知识点
`investigation_guidance`	array	否	调查建议
`decision_points`	array	否	判定关键点
`related_entities`	object	否	相关实体/TTP/IOC
`related_refs`	object	否	相关文档
`tags`	array	否	标签
`updated_at`	string	否	更新时间

标准化输出目标

标准化后的 Case 结构

标准化脚本输出建议字段：

id
memory_type = case
scenario
title
abstract
verdict
severity
entities
observables
evidence
patterns
related_refs
source_path
tags

标准化后的 KB 结构

标准化脚本输出建议字段：

id
memory_type = knowledge
doc_type
scenario
title
abstract
key_points
investigation_guidance
decision_points
related_refs
source_path
tags

检索测试建议

在 mock 数据阶段，优先验证：

钓鱼 case 是否能召回 phishing playbook 和相似 phishing case
O365 登录异常 case 是否能召回登录异常 KB 和相似 case
真报与误报 case 是否能被区分并保留不同模式
召回结果是否包含关键 evidence / decision points

5.0 KiB Raw Blame History

Sample Data Spec

目标

目录约定

Mock Case 原始格式

字段定义

示例骨架

Mock KB / Playbook 原始格式

字段定义

标准化输出目标

标准化后的 Case 结构

标准化后的 KB 结构

检索测试建议

5.0 KiB

Raw Blame History