Files
EverOS/docs/locomo_benchmark.md
Elliot Chen 518b8eca85 chore: initialize EverOS 1.0.0
md-first memory extraction framework for AI agents.

Markdown is the single source of truth; SQLite holds state and LanceDB
provides the rebuildable vector + BM25 + scalar index. The codebase follows
a single-direction DDD layering (entrypoints -> service -> memory -> infra,
with component / core / config cross-cutting) enforced by import-linter.

Engineering surface:
- Coding conventions in .claude/rules/ (path-scoped) and workflows in
  .claude/skills/ (/commit, /new-branch, /pr).
- GitHub Actions CI runs make lint + test + integration; pre-commit mirrors
  the gates locally (ruff, hygiene hooks, gitlint commit-msg).
- Commit messages follow Conventional Commits, enforced by gitlint.
- make lint also enforces datetime two-zone discipline and OpenAPI drift.
2026-06-06 07:33:17 +08:00

127 lines
3.2 KiB
Markdown

# Running the LoCoMo Benchmark
This guide walks through reproducing EverOS's LoCoMo retrieval scores
locally using the `hybrid` and `agentic` search methods.
## Contents
- [Prerequisites](#prerequisites)
- [1. Prepare the dataset](#1-prepare-the-dataset)
- [2. Start the server](#2-start-the-server)
- [3. Run `hybrid`](#3-run-hybrid)
- [4. Run `agentic`](#4-run-agentic)
- [5. Where the results land](#5-where-the-results-land)
- [Notes](#notes)
---
## Prerequisites
- Python **3.12**, [uv](https://docs.astral.sh/uv/)
- A `.env` at the repo root with the LLM / embedding credentials EverOS
needs:
- `EVEROS_LLM__MODEL`, `EVEROS_LLM__API_KEY`, `EVEROS_LLM__BASE_URL`
- `EVEROS_EMBEDDING__*`
- `EVEROS_RERANK__*`
- The benchmark driver also reads `LLM_API_KEY` / `ANSWER_MODEL` /
`JUDGE_MODEL` for the answer + judge passes.
Install the project:
```bash
uv sync
```
## 1. Prepare the dataset
Place the LoCoMo file at `data/locomo10.json` (the dataset is
distributed by the LoCoMo authors, not this repo). Override the path
later with `--data-path` if you keep it elsewhere.
## 2. Start the server
```bash
EVEROS_MEMORY__ROOT=~/.everos \
uv run python -m everos.entrypoints.cli.main server start --port 8000
```
`EVEROS_MEMORY__ROOT` isolates one benchmark's corpus from another —
change it (or `rm -rf` it) whenever you want a clean run.
Leave the server running in one terminal; run the benchmark from
another.
## 3. Run `hybrid`
Single conversation:
```bash
bash tests/run_locomo_batch.sh \
--conv-indices 0 \
--methods hybrid \
--base-url http://localhost:8000 \
--top-k 10
```
All 10 conversations, 2-way parallel:
```bash
bash tests/run_locomo_batch.sh \
--conv-indices 0-9 \
--methods hybrid \
--base-url http://localhost:8000 \
--top-k 10 \
--concurrency 2
```
The wrapper picks up `EVEROS_MEMORY__ROOT` from the environment so the
cascade poll path matches the server's data root. If you set them
differently, pass `--corpus-path` explicitly.
## 4. Run `agentic`
Same wrapper, swap `--methods`:
```bash
bash tests/run_locomo_batch.sh \
--conv-indices 0-9 \
--methods agentic \
--base-url http://localhost:8000 \
--top-k 10 \
--concurrency 2
```
You can also benchmark multiple methods in one go — they share the
same ingested corpus:
```bash
bash tests/run_locomo_batch.sh \
--conv-indices 0-9 \
--methods hybrid,agentic \
--base-url http://localhost:8000 \
--top-k 10 \
--concurrency 2
```
## 5. Where the results land
Default output root is `benchmark_results/run_<timestamp>/`. Override
with `--output-root`:
```
<output_root>/
├── conv0.json … conv9.json # per-conv summary + per-question details
├── conv0.log … conv9.log # per-conv stdout (only in --concurrency >1 mode)
└── conv0_checkpoints/ … # incremental search/answer/eval JSON
```
An aggregate accuracy table prints at the end of the wrapper run.
## Notes
- **Re-running on the same corpus**: add `--skip-add` to skip ingest and
reuse what's already in `~/.everos`. Useful when comparing methods
side by side.
- **Judge variance**: `--judge-runs 3` runs the judge three times per
question and majority-votes; slower but reduces LLM-judge noise.