Files
memory-gateway/docs/sample-data-spec.md

189 lines
5.0 KiB
Markdown

# Sample Data Spec
## 目标
这个文档定义 SOC Memory POC 在无真实数据阶段使用的 mock 数据格式,用于:
- 验证 ingestion pipeline
- 验证标准化脚本
- 验证 context retrieval
- 验证 case summary 与 memory commit 流程
当前只覆盖两类场景:
- 钓鱼邮件
- O365 异常登录 / 疑似账号被盗
## 目录约定
```text
evaluation/datasets/
├── mock_cases/
│ ├── phishing/
│ └── o365_suspicious_login/
└── mock_kb/
├── playbooks/
├── kb/
└── reports/
```
## Mock Case 原始格式
每个 case 使用一个 JSON 文件,文件名建议:
```text
<case_id>.json
```
### 字段定义
| 字段 | 类型 | 必填 | 说明 |
|---|---|---:|---|
| `case_id` | string | 是 | case 唯一 ID |
| `title` | string | 是 | 简短标题 |
| `scenario` | string | 是 | `phishing``o365_suspicious_login` |
| `alert_type` | string | 是 | 告警类型 |
| `severity` | string | 是 | `low` / `medium` / `high` / `critical` |
| `status` | string | 是 | `confirmed` / `false_positive` / `pending` |
| `time_window` | object | 是 | 开始和结束时间 |
| `summary` | string | 是 | 一句话摘要 |
| `alert_source` | string | 是 | 告警来源系统 |
| `entities` | object | 是 | 关键实体 |
| `observables` | object | 否 | IOC/可观测对象 |
| `evidence` | array | 是 | 关键证据列表 |
| `investigation_steps` | array | 是 | 关键调查步骤 |
| `conclusion` | object | 是 | 研判结论 |
| `related_refs` | object | 否 | 相关 KB / playbook / case |
| `lessons_learned` | array | 否 | 复用经验 |
| `tags` | array | 否 | 标签 |
### 示例骨架
```json
{
"case_id": "CASE-2026-0001",
"title": "Potential phishing email targeting finance user",
"scenario": "phishing",
"alert_type": "mail_suspicious_attachment",
"severity": "high",
"status": "confirmed",
"time_window": {
"start": "2026-04-01T09:10:00+08:00",
"end": "2026-04-01T11:30:00+08:00"
},
"summary": "Finance user received an invoice-themed phishing email with a malicious HTML attachment.",
"alert_source": "Secure Email Gateway",
"entities": {
"users": ["alice@corp.example"],
"hosts": ["FIN-LAPTOP-12"],
"mailboxes": ["alice@corp.example"]
},
"observables": {
"sender_emails": ["billing@vendor-payments.com"],
"domains": ["vendor-payments.com"],
"urls": ["https://vendor-payments-login.com/review"],
"hashes": ["sha256:..."],
"ips": ["198.51.100.20"]
},
"evidence": [
"The sender domain was newly observed and failed DMARC.",
"The attachment redirected the user to a credential harvesting page."
],
"investigation_steps": [
"Validate sender reputation and authentication results.",
"Detonate attachment in sandbox.",
"Check click telemetry and account sign-in logs."
],
"conclusion": {
"verdict": "true_positive",
"reason": "Multiple aligned phishing indicators and confirmed click behavior.",
"recommended_actions": [
"Reset the impacted account password.",
"Block the sender domain and landing URL."
]
},
"related_refs": {
"playbooks": ["PB-PHISH-001"],
"kb": ["KB-PHISH-HEADER-CHECK"],
"cases": []
},
"lessons_learned": [
"Invoice-themed phishing remains effective against finance users."
],
"tags": ["phishing", "email", "credential-harvest"]
}
```
## Mock KB / Playbook 原始格式
每个知识条目使用一个 JSON 文件,文件名建议:
```text
<doc_id>.json
```
### 字段定义
| 字段 | 类型 | 必填 | 说明 |
|---|---|---:|---|
| `doc_id` | string | 是 | 文档唯一 ID |
| `doc_type` | string | 是 | `kb` / `playbook` / `report_summary` |
| `title` | string | 是 | 标题 |
| `scenario` | string | 是 | 适用场景 |
| `summary` | string | 是 | 核心摘要 |
| `applicability` | array | 否 | 适用条件 |
| `key_points` | array | 是 | 核心知识点 |
| `investigation_guidance` | array | 否 | 调查建议 |
| `decision_points` | array | 否 | 判定关键点 |
| `related_entities` | object | 否 | 相关实体/TTP/IOC |
| `related_refs` | object | 否 | 相关文档 |
| `tags` | array | 否 | 标签 |
| `updated_at` | string | 否 | 更新时间 |
## 标准化输出目标
### 标准化后的 Case 结构
标准化脚本输出建议字段:
- `id`
- `memory_type` = `case`
- `scenario`
- `title`
- `abstract`
- `verdict`
- `severity`
- `entities`
- `observables`
- `evidence`
- `patterns`
- `related_refs`
- `source_path`
- `tags`
### 标准化后的 KB 结构
标准化脚本输出建议字段:
- `id`
- `memory_type` = `knowledge`
- `doc_type`
- `scenario`
- `title`
- `abstract`
- `key_points`
- `investigation_guidance`
- `decision_points`
- `related_refs`
- `source_path`
- `tags`
## 检索测试建议
在 mock 数据阶段,优先验证:
- 钓鱼 case 是否能召回 phishing playbook 和相似 phishing case
- O365 登录异常 case 是否能召回登录异常 KB 和相似 case
- 真报与误报 case 是否能被区分并保留不同模式
- 召回结果是否包含关键 evidence / decision points