Files
EverOS/docs/locomo_benchmark.md
Elliot Chen 518b8eca85 chore: initialize EverOS 1.0.0
md-first memory extraction framework for AI agents.

Markdown is the single source of truth; SQLite holds state and LanceDB
provides the rebuildable vector + BM25 + scalar index. The codebase follows
a single-direction DDD layering (entrypoints -> service -> memory -> infra,
with component / core / config cross-cutting) enforced by import-linter.

Engineering surface:
- Coding conventions in .claude/rules/ (path-scoped) and workflows in
  .claude/skills/ (/commit, /new-branch, /pr).
- GitHub Actions CI runs make lint + test + integration; pre-commit mirrors
  the gates locally (ruff, hygiene hooks, gitlint commit-msg).
- Commit messages follow Conventional Commits, enforced by gitlint.
- make lint also enforces datetime two-zone discipline and OpenAPI drift.
2026-06-06 07:33:17 +08:00

3.2 KiB

Running the LoCoMo Benchmark

This guide walks through reproducing EverOS's LoCoMo retrieval scores locally using the hybrid and agentic search methods.

Contents


Prerequisites

  • Python 3.12, uv
  • A .env at the repo root with the LLM / embedding credentials EverOS needs:
    • EVEROS_LLM__MODEL, EVEROS_LLM__API_KEY, EVEROS_LLM__BASE_URL
    • EVEROS_EMBEDDING__*
    • EVEROS_RERANK__*
    • The benchmark driver also reads LLM_API_KEY / ANSWER_MODEL / JUDGE_MODEL for the answer + judge passes.

Install the project:

uv sync

1. Prepare the dataset

Place the LoCoMo file at data/locomo10.json (the dataset is distributed by the LoCoMo authors, not this repo). Override the path later with --data-path if you keep it elsewhere.

2. Start the server

EVEROS_MEMORY__ROOT=~/.everos \
uv run python -m everos.entrypoints.cli.main server start --port 8000

EVEROS_MEMORY__ROOT isolates one benchmark's corpus from another — change it (or rm -rf it) whenever you want a clean run.

Leave the server running in one terminal; run the benchmark from another.

3. Run hybrid

Single conversation:

bash tests/run_locomo_batch.sh \
  --conv-indices 0 \
  --methods hybrid \
  --base-url http://localhost:8000 \
  --top-k 10

All 10 conversations, 2-way parallel:

bash tests/run_locomo_batch.sh \
  --conv-indices 0-9 \
  --methods hybrid \
  --base-url http://localhost:8000 \
  --top-k 10 \
  --concurrency 2

The wrapper picks up EVEROS_MEMORY__ROOT from the environment so the cascade poll path matches the server's data root. If you set them differently, pass --corpus-path explicitly.

4. Run agentic

Same wrapper, swap --methods:

bash tests/run_locomo_batch.sh \
  --conv-indices 0-9 \
  --methods agentic \
  --base-url http://localhost:8000 \
  --top-k 10 \
  --concurrency 2

You can also benchmark multiple methods in one go — they share the same ingested corpus:

bash tests/run_locomo_batch.sh \
  --conv-indices 0-9 \
  --methods hybrid,agentic \
  --base-url http://localhost:8000 \
  --top-k 10 \
  --concurrency 2

5. Where the results land

Default output root is benchmark_results/run_<timestamp>/. Override with --output-root:

<output_root>/
├── conv0.json … conv9.json          # per-conv summary + per-question details
├── conv0.log  … conv9.log           # per-conv stdout (only in --concurrency >1 mode)
└── conv0_checkpoints/ …             # incremental search/answer/eval JSON

An aggregate accuracy table prints at the end of the wrapper run.

Notes

  • Re-running on the same corpus: add --skip-add to skip ingest and reuse what's already in ~/.everos. Useful when comparing methods side by side.
  • Judge variance: --judge-runs 3 runs the judge three times per question and majority-votes; slower but reduces LLM-judge noise.