chore: initialize EverOS 1.0.0

md-first memory extraction framework for AI agents. Markdown is the single source of truth; SQLite holds state and LanceDB provides the rebuildable vector + BM25 + scalar index. The codebase follows a single-direction DDD layering (entrypoints -> service -> memory -> infra, with component / core / config cross-cutting) enforced by import-linter. Engineering surface: - Coding conventions in .claude/rules/ (path-scoped) and workflows in .claude/skills/ (/commit, /new-branch, /pr). - GitHub Actions CI runs make lint + test + integration; pre-commit mirrors the gates locally (ruff, hygiene hooks, gitlint commit-msg). - Commit messages follow Conventional Commits, enforced by gitlint. - make lint also enforces datetime two-zone discipline and OpenAPI drift.
2026-06-05 22:35:51 +08:00
commit 518b8eca85
636 changed files with 160553 additions and 0 deletions
--- a/docs/cascade_runbook.md
+++ b/docs/cascade_runbook.md
@ -0,0 +1,271 @@
+# Cascade Runbook
+
+The cascade daemon keeps LanceDB in sync with the markdown files under
+the memory root. Service / entry points only ever write markdown; the
+daemon is the **sole** writer of the LanceDB index. This runbook covers
+the recurring operational questions.
+
+## What runs where
+
+When `everos server start` boots, the FastAPI lifespan wires four
+providers in order:
+
+1. **Metrics** — Prometheus collector.
+2. **SQLite** — system DB + schema (`SQLModel.metadata.create_all`).
+3. **LanceDB** — async connection + schema verification + FTS indexes.
+4. **Cascade** — watcher + scanner + worker, all in-process tasks.
+
+The cascade subsystem itself is three independent loops:
+
+| Loop | Source signal | Effect |
+|---|---|---|
+| Watcher | `watchdog` filesystem events (sync thread) | `md_change_state.upsert` per registered kind |
+| Scanner | Periodic walk (`scan_interval_seconds`, default 30 s) | Same — catches changes the watcher missed |
+| Worker | `claim_pending_batch` polling (default 1 s when idle) | Handler dispatch → LanceDB upsert / delete |
+
+Every loop talks to the same `md_change_state` sqlite table. The
+worker's claim mode (`pending → processing → done/failed`) keeps
+concurrent workers honest.
+
+## Health: `everos cascade status`
+
+```
+queue:
+  pending:                   3
+  done:                      1247
+  failed (retryable=TRUE):   1     (eligible for `cascade fix --apply`)
+  failed (retryable=FALSE):  1     (fix md and re-save to recover)
+lsn:
+  max:           1252
+  last_processed: 1250
+  lag:            2
+```
+
+- `lag > 0` means the worker is behind. Steady state should hover near
+  zero; sustained lag points at a slow handler or a stuck retry.
+- `failed (retryable=FALSE)` is always user-actionable. Cascade will
+  never auto-clear these — they represent malformed md the user must
+  edit.
+
+## Recovering from failures: `everos cascade fix`
+
+`cascade fix` (no flag) lists every failed row. With `--apply`:
+
+1. `UPDATE md_change_state SET status='pending', retry_count=0
+   WHERE status='failed' AND retryable=TRUE` (the partial index
+   `idx_md_change_retryable` makes this O(retryable)).
+2. Drain the worker once so the retry runs synchronously.
+
+Retryable failures cover transient embedding / HTTP errors (5xx, 429,
+network resets) after the inline `MAX_RETRY=3` was exhausted. The
+fix command resets the counter so a working backend gets a clean
+start.
+
+`retryable=FALSE` rows require the user to edit the md (typically a
+YAML frontmatter issue) and re-save; the watcher picks the change up
+naturally.
+
+## One-shot replay: `everos cascade sync [PATH]`
+
+Use this when the watcher missed an event (WSL mount, network share,
+external editor with no inotify) or when you want a deterministic
+flush before, say, a smoke test:
+
+```bash
+everos cascade sync                           # drain everything pending
+everos cascade sync users/u1/episodes/X.md    # re-enqueue + drain
+```
+
+The CLI builds the same `CascadeOrchestrator` as the daemon but only
+calls `sync_once` / `drain_once` — no watcher / scanner background
+task. So it's safe to run in parallel with a live `everos server`.
+
+## Recovery paths
+
+### LanceDB schema drift on startup
+
+`LanceDBLifespanProvider.startup` calls `verify_business_schemas`. If
+an on-disk table has columns the current Pydantic schema does not
+declare (or vice versa), the boot fails with:
+
+```
+LanceDB table 'episode' schema drift: missing=[...], extra=[...].
+The index is rebuildable from md — recover with
+`rm -rf ~/.everos/.index/lancedb` and restart.
+```
+
+This is the documented recovery: delete the index, restart the
+server, the scanner will pick up every md file on its first sweep and
+the worker repopulates LanceDB. Markdown is the source of truth, so
+no data is lost.
+
+### inotify watch-limit exhaustion (Linux)
+
+Default kernel limit is 8 192 watches per user. On a sizeable memory
+root the watcher may silently miss events. Symptoms:
+
+- Scanner catches the file changes but the watcher never logs an
+  event for the same path.
+- `cat /proc/sys/fs/inotify/max_user_watches` is at the limit.
+
+Fix by bumping the kernel parameter:
+
+```bash
+echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf
+sudo sysctl -p
+```
+
+### WSL2 / network mounts
+
+Filesystem events do not propagate from the Windows host into WSL2
+(or across most SMB / NFS shares). The watcher will start without
+error and silently see nothing.
+
+Workarounds:
+
+- Rely on the scanner — at default 30 s interval, throughput is
+  bounded but eventually-consistent.
+- Drop the scan interval to ~5 s if the memory root is small.
+- Run `everos cascade sync` explicitly after batch edits.
+
+### Daemon process crash mid-batch
+
+`claim_pending_batch` flips rows to `processing` *atomically*. If the
+process dies before `mark_done` / `mark_failed`, those rows stay in
+`processing` until the next boot. **The orchestrator auto-recovers**
+on startup: `CascadeOrchestrator.start` calls
+`md_change_state_repo.recover_orphan_processing()` before launching
+the watcher / scanner / worker, which resets every `processing` row
+back to `pending`. Single-process cascade means no race — at boot
+time no other worker could legitimately own a `processing` row.
+
+No operator action required; the structured log line
+`cascade_recovered_orphan_processing` reports the count when it
+fires.
+
+### FD exhaustion (`os error 24` / EMFILE)
+
+Symptoms (any of these on a long-running daemon):
+
+- LanceDB query / index build fails with `lance error: ... Too many
+  open files (os error 24)`.
+- `lsof -p <pid> | wc -l` grows monotonically over hours / days.
+- Health log lines like `cascade_lancedb_optimize_failed` /
+  `cascade_lancedb_rebuild_failed` carrying `OSError: [Errno 24]`.
+
+Cause (verified against `lance crate 4.0`): the LanceDB *index* cache
+(`GlobalIndexCache`) holds one reader object per opened FTS / vector
+/ scalar index, and each reader pins the file descriptors of its
+`_indices/<uuid>/...` files. With a long-running daemon and steady-
+state cascade ingest, every `optimize()` call adds new readers; with
+LanceDB's own default (`index_cache_size_bytes=None`, unbounded), they
+**are never evicted** and the FDs leak monotonically.
+
+`drop_index` does **not** help — it is a manifest-only operation and
+leaves the on-disk UUID directories untouched. Even an explicit
+`optimize(cleanup_older_than=0)` `unlink()`-ing the files does not
+release FDs: POSIX keeps the inode alive as long as a process holds
+an open FD on it (the entries show as `(deleted)` in `lsof`). Only an
+LRU eviction inside the cache (or a connection close) actually closes
+the FDs.
+
+Fix (already wired in `LanceDBSettings.index_cache_size_bytes` —
+default 16 MB, ~290 FD ceiling): see
+[Tuning knobs § LanceDB index cache](#lancedb-index-cache-index_cache_size_bytes)
+for the sizing table and the env-var override path.
+
+If you have already hit EMFILE in a running process, the cleanest
+recovery is a daemon restart — the open connection closes, every FD
+is released, and the next start comes up with the capped Session in
+place.
+
+## Tuning knobs
+
+### Cascade scheduler knobs
+
+All defaults live in `everos.memory.cascade.orchestrator.CascadeConfig`
+and `everos.memory.cascade.worker.CascadeWorker`:
+
+| Knob | Default | Effect |
+|---|---|---|
+| `scan_interval_seconds` | 30 | Scanner sweep cadence |
+| `worker_batch_size` | 50 | Rows claimed per worker cycle |
+| `worker_max_retry` | 3 | Inline retries before `mark_failed(retryable=TRUE)` |
+| `worker_poll_interval_seconds` | 1 | Idle wait between empty drain attempts |
+| `worker_retry_backoff_seconds` | 2 | Linear backoff seed; doubles per attempt |
+
+Tuning surface is intentionally not in `Settings` yet — once we have
+wall-clock numbers from real workloads, the values that need
+operator override will surface there.
+
+### LanceDB index cache (`index_cache_size_bytes`)
+
+Lives in `LanceDBSettings`; overridable via the
+`EVEROS_LANCEDB__INDEX_CACHE_SIZE_BYTES` environment variable. This
+is the only knob that bounds the steady-state file-descriptor count
+of a long-running EverOS daemon — see
+[Recovery paths § FD exhaustion](#fd-exhaustion-os-error-24-emfile)
+for why nothing else (prune, rebuild, `drop_index`) helps.
+
+Measured cap → FD ceiling (30 add+optimize cycles + 100-query stress
+on the real `Episode` schema):
+
+| Cap | FD ceiling | Query latency (p50) | Safe under `ulimit -n` |
+|---|---|---|---|
+| `2 MB` | ~45 | ~5 ms | macOS default 256 (5× headroom) |
+| `4 MB` | ~52 | ~3 ms | macOS default 256 |
+| `8 MB` | ~140 | ~2.4 ms | macOS default 256 (1.8× headroom) |
+| **`16 MB`** (default) | **~290** | **~2.3 ms** | **Linux default 1024 (3.5× headroom); macOS needs `ulimit -n 1024`** |
+| `32 MB` | ~630 | ~1.4 ms | Linux default 1024 (1.6× headroom) |
+| `unbounded` | grows forever | ~1.3 ms | NEVER use in a daemon |
+
+EverOS's measured steady-state working set after a `rebuild_indexes`
+cycle is roughly **50-100 readers / 3-6 MB resident** (5 tables × ~7
+BM25 columns × ~10 `part_N` reader entries each), so the 16 MB default
+provides ~3× headroom for burst traffic and stale-but-not-yet-evicted
+readers.
+
+When to override:
+
+- **Tight `ulimit -n` environments** (containers; macOS dev boxes
+  that haven't bumped the default 256) → drop to `4 MB` or `8 MB`.
+  Query latency increases by ~1-3 ms but correctness is unaffected.
+- **Larger working sets** (many more tables or much wider FTS
+  indexes than the default schema set) → bump to `32-64 MB`. Verify
+  your platform's `ulimit -n` covers the corresponding FD ceiling
+  with at least 2× headroom.
+- **Diagnostic-only**: set to a tiny value (e.g. `1 MB`) to
+  *force* LRU thrashing and reproduce cache-miss latency in tests.
+
+Do **not** set `metadata_cache_size_bytes` — it is intentionally left
+at LanceDB's default (unbounded) because the metadata cache holds
+parsed manifests / fragment stats and has zero effect on FD count;
+capping it just thrashes parsing work without solving anything.
+
+## Concurrency
+
+The worker is async, not multi-process. Inside one drain cycle,
+`asyncio.gather(*[_process_one(row) for row in batch])` runs every
+claimed row concurrently — cascade is IO-bound (embedding HTTP calls
+dominate wall time) so single-process coroutine concurrency saturates
+the bottleneck. The `worker_batch_size` knob (default 50) caps
+in-flight rows.
+
+Multi-process workers are a scaling axis we'd reach for only if a
+single process becomes CPU-bound, which the current design does not
+anticipate. `claim_pending_batch` is already race-safe (the
+``WHERE status='pending'`` filter ensures each row lands in exactly
+one batch even if multiple workers raced), so adding processes later
+is a deployment-side change with no schema work.
+
+## What cascade does NOT do (yet)
+
+- **Schema migration**: LanceDB column changes require `rm -rf`.
+- **Parent-id back-link**: Episode rows currently carry
+  `parent_id=None`; the writer doesn't preserve the source memcell id
+  in the entry inline. Tracked separately.
+- **Reference-file change detection (agent_skill)**: edits to
+  `references/*.md` siblings won't trigger a re-index — only changes
+  to `SKILL.md` itself fire the watcher. Workaround: run
+  `everos cascade sync agents/<a>/skills/skill_<n>/SKILL.md` after
+  editing references.