Files
memory-gateway/docs/sample-data-spec.md

5.0 KiB

Sample Data Spec

目标

这个文档定义 SOC Memory POC 在无真实数据阶段使用的 mock 数据格式,用于:

  • 验证 ingestion pipeline
  • 验证标准化脚本
  • 验证 context retrieval
  • 验证 case summary 与 memory commit 流程

当前只覆盖两类场景:

  • 钓鱼邮件
  • O365 异常登录 / 疑似账号被盗

目录约定

evaluation/datasets/
├── mock_cases/
│   ├── phishing/
│   └── o365_suspicious_login/
└── mock_kb/
    ├── playbooks/
    ├── kb/
    └── reports/

Mock Case 原始格式

每个 case 使用一个 JSON 文件,文件名建议:

<case_id>.json

字段定义

字段 类型 必填 说明
case_id string case 唯一 ID
title string 简短标题
scenario string phishingo365_suspicious_login
alert_type string 告警类型
severity string low / medium / high / critical
status string confirmed / false_positive / pending
time_window object 开始和结束时间
summary string 一句话摘要
alert_source string 告警来源系统
entities object 关键实体
observables object IOC/可观测对象
evidence array 关键证据列表
investigation_steps array 关键调查步骤
conclusion object 研判结论
related_refs object 相关 KB / playbook / case
lessons_learned array 复用经验
tags array 标签

示例骨架

{
  "case_id": "CASE-2026-0001",
  "title": "Potential phishing email targeting finance user",
  "scenario": "phishing",
  "alert_type": "mail_suspicious_attachment",
  "severity": "high",
  "status": "confirmed",
  "time_window": {
    "start": "2026-04-01T09:10:00+08:00",
    "end": "2026-04-01T11:30:00+08:00"
  },
  "summary": "Finance user received an invoice-themed phishing email with a malicious HTML attachment.",
  "alert_source": "Secure Email Gateway",
  "entities": {
    "users": ["alice@corp.example"],
    "hosts": ["FIN-LAPTOP-12"],
    "mailboxes": ["alice@corp.example"]
  },
  "observables": {
    "sender_emails": ["billing@vendor-payments.com"],
    "domains": ["vendor-payments.com"],
    "urls": ["https://vendor-payments-login.com/review"],
    "hashes": ["sha256:..."],
    "ips": ["198.51.100.20"]
  },
  "evidence": [
    "The sender domain was newly observed and failed DMARC.",
    "The attachment redirected the user to a credential harvesting page."
  ],
  "investigation_steps": [
    "Validate sender reputation and authentication results.",
    "Detonate attachment in sandbox.",
    "Check click telemetry and account sign-in logs."
  ],
  "conclusion": {
    "verdict": "true_positive",
    "reason": "Multiple aligned phishing indicators and confirmed click behavior.",
    "recommended_actions": [
      "Reset the impacted account password.",
      "Block the sender domain and landing URL."
    ]
  },
  "related_refs": {
    "playbooks": ["PB-PHISH-001"],
    "kb": ["KB-PHISH-HEADER-CHECK"],
    "cases": []
  },
  "lessons_learned": [
    "Invoice-themed phishing remains effective against finance users."
  ],
  "tags": ["phishing", "email", "credential-harvest"]
}

Mock KB / Playbook 原始格式

每个知识条目使用一个 JSON 文件,文件名建议:

<doc_id>.json

字段定义

字段 类型 必填 说明
doc_id string 文档唯一 ID
doc_type string kb / playbook / report_summary
title string 标题
scenario string 适用场景
summary string 核心摘要
applicability array 适用条件
key_points array 核心知识点
investigation_guidance array 调查建议
decision_points array 判定关键点
related_entities object 相关实体/TTP/IOC
related_refs object 相关文档
tags array 标签
updated_at string 更新时间

标准化输出目标

标准化后的 Case 结构

标准化脚本输出建议字段:

  • id
  • memory_type = case
  • scenario
  • title
  • abstract
  • verdict
  • severity
  • entities
  • observables
  • evidence
  • patterns
  • related_refs
  • source_path
  • tags

标准化后的 KB 结构

标准化脚本输出建议字段:

  • id
  • memory_type = knowledge
  • doc_type
  • scenario
  • title
  • abstract
  • key_points
  • investigation_guidance
  • decision_points
  • related_refs
  • source_path
  • tags

检索测试建议

在 mock 数据阶段,优先验证:

  • 钓鱼 case 是否能召回 phishing playbook 和相似 phishing case
  • O365 登录异常 case 是否能召回登录异常 KB 和相似 case
  • 真报与误报 case 是否能被区分并保留不同模式
  • 召回结果是否包含关键 evidence / decision points