md-first memory extraction framework for AI agents. Markdown is the single source of truth; SQLite holds state and LanceDB provides the rebuildable vector + BM25 + scalar index. The codebase follows a single-direction DDD layering (entrypoints -> service -> memory -> infra, with component / core / config cross-cutting) enforced by import-linter. Engineering surface: - Coding conventions in .claude/rules/ (path-scoped) and workflows in .claude/skills/ (/commit, /new-branch, /pr). - GitHub Actions CI runs make lint + test + integration; pre-commit mirrors the gates locally (ruff, hygiene hooks, gitlint commit-msg). - Commit messages follow Conventional Commits, enforced by gitlint. - make lint also enforces datetime two-zone discipline and OpenAPI drift.
3.2 KiB
Running the LoCoMo Benchmark
This guide walks through reproducing EverOS's LoCoMo retrieval scores
locally using the hybrid and agentic search methods.
Contents
- Prerequisites
- 1. Prepare the dataset
- 2. Start the server
- 3. Run
hybrid - 4. Run
agentic - 5. Where the results land
- Notes
Prerequisites
- Python 3.12, uv
- A
.envat the repo root with the LLM / embedding credentials EverOS needs:EVEROS_LLM__MODEL,EVEROS_LLM__API_KEY,EVEROS_LLM__BASE_URLEVEROS_EMBEDDING__*EVEROS_RERANK__*- The benchmark driver also reads
LLM_API_KEY/ANSWER_MODEL/JUDGE_MODELfor the answer + judge passes.
Install the project:
uv sync
1. Prepare the dataset
Place the LoCoMo file at data/locomo10.json (the dataset is
distributed by the LoCoMo authors, not this repo). Override the path
later with --data-path if you keep it elsewhere.
2. Start the server
EVEROS_MEMORY__ROOT=~/.everos \
uv run python -m everos.entrypoints.cli.main server start --port 8000
EVEROS_MEMORY__ROOT isolates one benchmark's corpus from another —
change it (or rm -rf it) whenever you want a clean run.
Leave the server running in one terminal; run the benchmark from another.
3. Run hybrid
Single conversation:
bash tests/run_locomo_batch.sh \
--conv-indices 0 \
--methods hybrid \
--base-url http://localhost:8000 \
--top-k 10
All 10 conversations, 2-way parallel:
bash tests/run_locomo_batch.sh \
--conv-indices 0-9 \
--methods hybrid \
--base-url http://localhost:8000 \
--top-k 10 \
--concurrency 2
The wrapper picks up EVEROS_MEMORY__ROOT from the environment so the
cascade poll path matches the server's data root. If you set them
differently, pass --corpus-path explicitly.
4. Run agentic
Same wrapper, swap --methods:
bash tests/run_locomo_batch.sh \
--conv-indices 0-9 \
--methods agentic \
--base-url http://localhost:8000 \
--top-k 10 \
--concurrency 2
You can also benchmark multiple methods in one go — they share the same ingested corpus:
bash tests/run_locomo_batch.sh \
--conv-indices 0-9 \
--methods hybrid,agentic \
--base-url http://localhost:8000 \
--top-k 10 \
--concurrency 2
5. Where the results land
Default output root is benchmark_results/run_<timestamp>/. Override
with --output-root:
<output_root>/
├── conv0.json … conv9.json # per-conv summary + per-question details
├── conv0.log … conv9.log # per-conv stdout (only in --concurrency >1 mode)
└── conv0_checkpoints/ … # incremental search/answer/eval JSON
An aggregate accuracy table prints at the end of the wrapper run.
Notes
- Re-running on the same corpus: add
--skip-addto skip ingest and reuse what's already in~/.everos. Useful when comparing methods side by side. - Judge variance:
--judge-runs 3runs the judge three times per question and majority-votes; slower but reduces LLM-judge noise.