Email DLP

An Email Data Loss Prevention (DLP) proof-of-concept tool that scans .eml files for policy violations using local or remote Large Language Models (LLMs). It extracts email content, parses and converts attachments, and evaluates the consolidated data against security policies to determine data risk levels and remediation actions (BLOCK, ALERT, or PASS).

Features

Parse & Extract: Parses raw MIME .eml files to extract metadata (Sender, Recipient, Subject, Date) and body text.
Deep Attachment Conversion: Automatically extracts and converts attachments into text representations. Now includes extraction of images from PDF, Office documents, and supports .zip / .7z archives.
Heuristic Policy Review: A fast, deterministic local engine that evaluates content against signals derived directly from policy.py (PII, Financial, Source Code, etc.) using keyword matching and context boosts.
Preview Mode: Inspect the parsed structure, metadata, and extracted text of emails and attachments without making any LLM calls.
Simulation Mode: Run a fast, deterministic local simulation of the DLP analysis without hitting an external LLM, which is extremely useful for local testing or CI environments.
LLM-Powered Analysis: Conduct full DLP analysis by passing the extracted content and a configured policy prompt to an OpenAI-compatible LLM endpoint (defaulting to local vLLM).

Installation

This project requires Python 3.11+ and relies on uv for dependency management.

# Install dependencies
uv sync

Configuration

You can use a .env file to manage your LLM credentials and endpoint. Copy the provided example to get started:

cp .env.example .env

The supported environment variables are:

OPENAI_BASE_URL: The API endpoint (e.g., http://localhost:8000/v1).
OPENAI_API_KEY: Your API key (use not-needed for local vLLM).
MODEL_NAME: The model name to use for analysis (e.g. Qwen/Qwen3.5-35B-A3B).

If these variables are set in .env, they will be used as defaults, although you can still override them using CLI flags like --endpoint and --model.

Usage

The CLI is built with Typer and provides three main commands: preview, simulate, and analyze.

You can view the main help menu via:

uv run email-dlp --help

1. Preview

Preview the parsed email content, converted attachment text, and the system prompt that would be sent to the LLM.

uv run email-dlp preview \
  --input data \
  --output preview-output \
  --include-system-prompt

2. Simulate

Execute a local simulation of the DLP parsing and analysis pipeline. This skips the LLM call entirely and returns mock evaluation results, outputting to directories like output/<timestamp>.

uv run email-dlp simulate \
  --input data \
  --output output/simulated \
  --summary

3. Analyze

Perform real DLP evaluation using an LLM. By default, it targets a local OpenAI-compatible endpoint. It calculates risk scores, identifies specific violation types, and determines policy actions.

uv run email-dlp analyze \
  --input data \
  --output output/llm-preds \
  --endpoint "http://localhost:8000/v1" \
  --model "Qwen/Qwen3.5-35B-A3B" \
  --summary

Outputs

Analysis and Simulation runs will create .json files in your specified output directory for every analyzed .eml file.

Additionally, a batch_summary.json will be generated, containing:

Total number of emails processed
Breakdown of actions taken (BLOCK, ALERT, PASS)
Distribution of identified risk levels (CRITICAL, HIGH, MEDIUM, LOW)
Array of high-level email summaries with hit properties.

Architecture Highlights

Uses markitdown, py7zr, and PyMuPDF for deep file-parsing, archive unpacking, and image extraction.
Validation and schema enforcement is handled via pydantic.
Rich console outputs (spinners, progress bars, and formatted tables) using rich.

4.0 KiB Raw Permalink Blame History