4.0 KiB
Email DLP
An Email Data Loss Prevention (DLP) proof-of-concept tool that scans .eml files for policy violations using local or remote Large Language Models (LLMs). It extracts email content, parses and converts attachments, and evaluates the consolidated data against security policies to determine data risk levels and remediation actions (BLOCK, ALERT, or PASS).
Features
- Parse & Extract: Parses raw MIME
.emlfiles to extract metadata (Sender, Recipient, Subject, Date) and body text. - Deep Attachment Conversion: Automatically extracts and converts attachments into text representations. Now includes extraction of images from PDF, Office documents, and supports .zip / .7z archives.
- Heuristic Policy Review: A fast, deterministic local engine that evaluates content against signals derived directly from
policy.py(PII, Financial, Source Code, etc.) using keyword matching and context boosts. - Preview Mode: Inspect the parsed structure, metadata, and extracted text of emails and attachments without making any LLM calls.
- Simulation Mode: Run a fast, deterministic local simulation of the DLP analysis without hitting an external LLM, which is extremely useful for local testing or CI environments.
- LLM-Powered Analysis: Conduct full DLP analysis by passing the extracted content and a configured policy prompt to an OpenAI-compatible LLM endpoint (defaulting to local vLLM).
Installation
This project requires Python 3.11+ and relies on uv for dependency management.
# Install dependencies
uv sync
Configuration
You can use a .env file to manage your LLM credentials and endpoint. Copy the provided example to get started:
cp .env.example .env
The supported environment variables are:
OPENAI_BASE_URL: The API endpoint (e.g.,http://localhost:8000/v1).OPENAI_API_KEY: Your API key (usenot-neededfor local vLLM).MODEL_NAME: The model name to use for analysis (e.g.Qwen/Qwen3.5-35B-A3B).
If these variables are set in .env, they will be used as defaults, although you can still override them using CLI flags like --endpoint and --model.
Usage
The CLI is built with Typer and provides three main commands: preview, simulate, and analyze.
You can view the main help menu via:
uv run email-dlp --help
1. Preview
Preview the parsed email content, converted attachment text, and the system prompt that would be sent to the LLM.
uv run email-dlp preview \
--input data \
--output preview-output \
--include-system-prompt
2. Simulate
Execute a local simulation of the DLP parsing and analysis pipeline. This skips the LLM call entirely and returns mock evaluation results, outputting to directories like output/<timestamp>.
uv run email-dlp simulate \
--input data \
--output output/simulated \
--summary
3. Analyze
Perform real DLP evaluation using an LLM. By default, it targets a local OpenAI-compatible endpoint. It calculates risk scores, identifies specific violation types, and determines policy actions.
uv run email-dlp analyze \
--input data \
--output output/llm-preds \
--endpoint "http://localhost:8000/v1" \
--model "Qwen/Qwen3.5-35B-A3B" \
--summary
Outputs
Analysis and Simulation runs will create .json files in your specified output directory for every analyzed .eml file.
Additionally, a batch_summary.json will be generated, containing:
- Total number of emails processed
- Breakdown of actions taken (
BLOCK,ALERT,PASS) - Distribution of identified risk levels (
CRITICAL,HIGH,MEDIUM,LOW) - Array of high-level email summaries with hit properties.
Architecture Highlights
- Uses
markitdown,py7zr, andPyMuPDFfor deep file-parsing, archive unpacking, and image extraction. - Validation and schema enforcement is handled via
pydantic. - Rich console outputs (spinners, progress bars, and formatted tables) using
rich.