Initial commit

This commit is contained in:
2026-03-20 10:28:28 +08:00
commit 1b4d5a277f
30 changed files with 14869 additions and 0 deletions

99
README.md Normal file
View File

@ -0,0 +1,99 @@
# Email DLP
An Email Data Loss Prevention (DLP) proof-of-concept tool that scans `.eml` files for policy violations using local or remote Large Language Models (LLMs). It extracts email content, parses and converts attachments, and evaluates the consolidated data against security policies to determine data risk levels and remediation actions (`BLOCK`, `ALERT`, or `PASS`).
## Features
- **Parse & Extract**: Parses raw MIME `.eml` files to extract metadata (Sender, Recipient, Subject, Date) and body text.
- **Deep Attachment Conversion**: Automatically extracts and converts attachments into text representations. Now includes extraction of images from **PDF**, **Office** documents, and supports **.zip** / **.7z** archives.
- **Heuristic Policy Review**: A fast, deterministic local engine that evaluates content against signals derived directly from `policy.py` (PII, Financial, Source Code, etc.) using keyword matching and context boosts.
- **Preview Mode**: Inspect the parsed structure, metadata, and extracted text of emails and attachments without making any LLM calls.
- **Simulation Mode**: Run a fast, deterministic local simulation of the DLP analysis without hitting an external LLM, which is extremely useful for local testing or CI environments.
- **LLM-Powered Analysis**: Conduct full DLP analysis by passing the extracted content and a configured policy prompt to an OpenAI-compatible LLM endpoint (defaulting to local vLLM).
## Installation
This project requires **Python 3.11+** and relies on [`uv`](https://github.com/astral-sh/uv) for dependency management.
```bash
# Install dependencies
uv sync
```
## Configuration
You can use a `.env` file to manage your LLM credentials and endpoint. Copy the provided example to get started:
```bash
cp .env.example .env
```
The supported environment variables are:
- `OPENAI_BASE_URL`: The API endpoint (e.g., `http://localhost:8000/v1`).
- `OPENAI_API_KEY`: Your API key (use `not-needed` for local vLLM).
- `MODEL_NAME`: The model name to use for analysis (e.g. `Qwen/Qwen3.5-35B-A3B`).
If these variables are set in `.env`, they will be used as defaults, although you can still override them using CLI flags like `--endpoint` and `--model`.
## Usage
The CLI is built with Typer and provides three main commands: `preview`, `simulate`, and `analyze`.
You can view the main help menu via:
```bash
uv run email-dlp --help
```
### 1. Preview
Preview the parsed email content, converted attachment text, and the system prompt that would be sent to the LLM.
```bash
uv run email-dlp preview \
--input data \
--output preview-output \
--include-system-prompt
```
### 2. Simulate
Execute a local simulation of the DLP parsing and analysis pipeline. This skips the LLM call entirely and returns mock evaluation results, outputting to directories like `output/<timestamp>`.
```bash
uv run email-dlp simulate \
--input data \
--output output/simulated \
--summary
```
### 3. Analyze
Perform real DLP evaluation using an LLM. By default, it targets a local OpenAI-compatible endpoint. It calculates risk scores, identifies specific violation types, and determines policy actions.
```bash
uv run email-dlp analyze \
--input data \
--output output/llm-preds \
--endpoint "http://localhost:8000/v1" \
--model "Qwen/Qwen3.5-35B-A3B" \
--summary
```
## Outputs
Analysis and Simulation runs will create `.json` files in your specified output directory for every analyzed `.eml` file.
Additionally, a `batch_summary.json` will be generated, containing:
- Total number of emails processed
- Breakdown of actions taken (`BLOCK`, `ALERT`, `PASS`)
- Distribution of identified risk levels (`CRITICAL`, `HIGH`, `MEDIUM`, `LOW`)
- Array of high-level email summaries with hit properties.
## Architecture Highlights
- Uses `markitdown`, `py7zr`, and `PyMuPDF` for deep file-parsing, archive unpacking, and image extraction.
- Validation and schema enforcement is handled via `pydantic`.
- Rich console outputs (spinners, progress bars, and formatted tables) using `rich`.