Files

Elliot Chen 518b8eca85 chore: initialize EverOS 1.0.0

md-first memory extraction framework for AI agents.

Markdown is the single source of truth; SQLite holds state and LanceDB
provides the rebuildable vector + BM25 + scalar index. The codebase follows
a single-direction DDD layering (entrypoints -> service -> memory -> infra,
with component / core / config cross-cutting) enforced by import-linter.

Engineering surface:
- Coding conventions in .claude/rules/ (path-scoped) and workflows in
  .claude/skills/ (/commit, /new-branch, /pr).
- GitHub Actions CI runs make lint + test + integration; pre-commit mirrors
  the gates locally (ruff, hygiene hooks, gitlint commit-msg).
- Commit messages follow Conventional Commits, enforced by gitlint.
- make lint also enforces datetime two-zone discipline and OpenAPI drift.

2026-06-06 07:33:17 +08:00

3.2 KiB

Raw Blame History

Running the LoCoMo Benchmark

This guide walks through reproducing EverOS's LoCoMo retrieval scores locally using the hybrid and agentic search methods.

Prerequisites
1. Prepare the dataset
2. Start the server
3. Run hybrid
4. Run agentic
5. Where the results land
Notes

Prerequisites

Python 3.12, uv
A .env at the repo root with the LLM / embedding credentials EverOS needs:
- EVEROS_LLM__MODEL, EVEROS_LLM__API_KEY, EVEROS_LLM__BASE_URL
- EVEROS_EMBEDDING__*
- EVEROS_RERANK__*
- The benchmark driver also reads LLM_API_KEY / ANSWER_MODEL / JUDGE_MODEL for the answer + judge passes.

Install the project:

uv sync

1. Prepare the dataset

Place the LoCoMo file at data/locomo10.json (the dataset is distributed by the LoCoMo authors, not this repo). Override the path later with --data-path if you keep it elsewhere.

2. Start the server

EVEROS_MEMORY__ROOT=~/.everos \
uv run python -m everos.entrypoints.cli.main server start --port 8000

EVEROS_MEMORY__ROOT isolates one benchmark's corpus from another — change it (or rm -rf it) whenever you want a clean run.

Leave the server running in one terminal; run the benchmark from another.

3. Run `hybrid`

Single conversation:

bash tests/run_locomo_batch.sh \
  --conv-indices 0 \
  --methods hybrid \
  --base-url http://localhost:8000 \
  --top-k 10

All 10 conversations, 2-way parallel:

bash tests/run_locomo_batch.sh \
  --conv-indices 0-9 \
  --methods hybrid \
  --base-url http://localhost:8000 \
  --top-k 10 \
  --concurrency 2

The wrapper picks up EVEROS_MEMORY__ROOT from the environment so the cascade poll path matches the server's data root. If you set them differently, pass --corpus-path explicitly.

4. Run `agentic`

Same wrapper, swap --methods:

bash tests/run_locomo_batch.sh \
  --conv-indices 0-9 \
  --methods agentic \
  --base-url http://localhost:8000 \
  --top-k 10 \
  --concurrency 2

You can also benchmark multiple methods in one go — they share the same ingested corpus:

bash tests/run_locomo_batch.sh \
  --conv-indices 0-9 \
  --methods hybrid,agentic \
  --base-url http://localhost:8000 \
  --top-k 10 \
  --concurrency 2

5. Where the results land

Default output root is benchmark_results/run_<timestamp>/. Override with --output-root:

<output_root>/
├── conv0.json … conv9.json          # per-conv summary + per-question details
├── conv0.log  … conv9.log           # per-conv stdout (only in --concurrency >1 mode)
└── conv0_checkpoints/ …             # incremental search/answer/eval JSON

An aggregate accuracy table prints at the end of the wrapper run.

Notes

Re-running on the same corpus: add --skip-add to skip ingest and reuse what's already in ~/.everos. Useful when comparing methods side by side.
Judge variance: --judge-runs 3 runs the judge three times per question and majority-votes; slower but reduces LLM-judge noise.

3.2 KiB Raw Blame History