memory-gateway/docs/sample-data-spec.md

# Sample Data Spec

## 目标

这个文档定义 SOC Memory POC 在无真实数据阶段使用的 mock 数据格式，用于：

- 验证 ingestion pipeline
- 验证标准化脚本
- 验证 context retrieval
- 验证 case summary 与 memory commit 流程

当前只覆盖两类场景：

- 钓鱼邮件
- O365 异常登录 / 疑似账号被盗

## 目录约定

```text
evaluation/datasets/
├── mock_cases/
│   ├── phishing/
│   └── o365_suspicious_login/
└── mock_kb/
    ├── playbooks/
    ├── kb/
    └── reports/
```

## Mock Case 原始格式

每个 case 使用一个 JSON 文件，文件名建议：

```text
<case_id>.json
```

### 字段定义

| 字段 | 类型 | 必填 | 说明 |
|---|---|---:|---|
| `case_id` | string | 是 | case 唯一 ID |
| `title` | string | 是 | 简短标题 |
| `scenario` | string | 是 | `phishing` 或 `o365_suspicious_login` |
| `alert_type` | string | 是 | 告警类型 |
| `severity` | string | 是 | `low` / `medium` / `high` / `critical` |
| `status` | string | 是 | `confirmed` / `false_positive` / `pending` |
| `time_window` | object | 是 | 开始和结束时间 |
| `summary` | string | 是 | 一句话摘要 |
| `alert_source` | string | 是 | 告警来源系统 |
| `entities` | object | 是 | 关键实体 |
| `observables` | object | 否 | IOC/可观测对象 |
| `evidence` | array | 是 | 关键证据列表 |
| `investigation_steps` | array | 是 | 关键调查步骤 |
| `conclusion` | object | 是 | 研判结论 |
| `related_refs` | object | 否 | 相关 KB / playbook / case |
| `lessons_learned` | array | 否 | 复用经验 |
| `tags` | array | 否 | 标签 |

### 示例骨架

```json
{
  "case_id": "CASE-2026-0001",
  "title": "Potential phishing email targeting finance user",
  "scenario": "phishing",
  "alert_type": "mail_suspicious_attachment",
  "severity": "high",
  "status": "confirmed",
  "time_window": {
    "start": "2026-04-01T09:10:00+08:00",
    "end": "2026-04-01T11:30:00+08:00"
  },
  "summary": "Finance user received an invoice-themed phishing email with a malicious HTML attachment.",
  "alert_source": "Secure Email Gateway",
  "entities": {
    "users": ["alice@corp.example"],
    "hosts": ["FIN-LAPTOP-12"],
    "mailboxes": ["alice@corp.example"]
  },
  "observables": {
    "sender_emails": ["billing@vendor-payments.com"],
    "domains": ["vendor-payments.com"],
    "urls": ["https://vendor-payments-login.com/review"],
    "hashes": ["sha256:..."],
    "ips": ["198.51.100.20"]
  },
  "evidence": [
    "The sender domain was newly observed and failed DMARC.",
    "The attachment redirected the user to a credential harvesting page."
  ],
  "investigation_steps": [
    "Validate sender reputation and authentication results.",
    "Detonate attachment in sandbox.",
    "Check click telemetry and account sign-in logs."
  ],
  "conclusion": {
    "verdict": "true_positive",
    "reason": "Multiple aligned phishing indicators and confirmed click behavior.",
    "recommended_actions": [
      "Reset the impacted account password.",
      "Block the sender domain and landing URL."
    ]
  },
  "related_refs": {
    "playbooks": ["PB-PHISH-001"],
    "kb": ["KB-PHISH-HEADER-CHECK"],
    "cases": []
  },
  "lessons_learned": [
    "Invoice-themed phishing remains effective against finance users."
  ],
  "tags": ["phishing", "email", "credential-harvest"]
}
```

## Mock KB / Playbook 原始格式

每个知识条目使用一个 JSON 文件，文件名建议：

```text
<doc_id>.json
```

### 字段定义

| 字段 | 类型 | 必填 | 说明 |
|---|---|---:|---|
| `doc_id` | string | 是 | 文档唯一 ID |
| `doc_type` | string | 是 | `kb` / `playbook` / `report_summary` |
| `title` | string | 是 | 标题 |
| `scenario` | string | 是 | 适用场景 |
| `summary` | string | 是 | 核心摘要 |
| `applicability` | array | 否 | 适用条件 |
| `key_points` | array | 是 | 核心知识点 |
| `investigation_guidance` | array | 否 | 调查建议 |
| `decision_points` | array | 否 | 判定关键点 |
| `related_entities` | object | 否 | 相关实体/TTP/IOC |
| `related_refs` | object | 否 | 相关文档 |
| `tags` | array | 否 | 标签 |
| `updated_at` | string | 否 | 更新时间 |

## 标准化输出目标

### 标准化后的 Case 结构

标准化脚本输出建议字段：

- `id`
- `memory_type` = `case`
- `scenario`
- `title`
- `abstract`
- `verdict`
- `severity`
- `entities`
- `observables`
- `evidence`
- `patterns`
- `related_refs`
- `source_path`
- `tags`

### 标准化后的 KB 结构

标准化脚本输出建议字段：

- `id`
- `memory_type` = `knowledge`
- `doc_type`
- `scenario`
- `title`
- `abstract`
- `key_points`
- `investigation_guidance`
- `decision_points`
- `related_refs`
- `source_path`
- `tags`

## 检索测试建议

在 mock 数据阶段，优先验证：

- 钓鱼 case 是否能召回 phishing playbook 和相似 phishing case
- O365 登录异常 case 是否能召回登录异常 KB 和相似 case
- 真报与误报 case 是否能被区分并保留不同模式
- 召回结果是否包含关键 evidence / decision points