# Cascade Runbook The cascade daemon keeps LanceDB in sync with the markdown files under the memory root. Service / entry points only ever write markdown; the daemon is the **sole** writer of the LanceDB index. This runbook covers the recurring operational questions. ## What runs where When `everos server start` boots, the FastAPI lifespan wires four providers in order: 1. **Metrics** — Prometheus collector. 2. **SQLite** — system DB + schema (`SQLModel.metadata.create_all`). 3. **LanceDB** — async connection + schema verification + FTS indexes. 4. **Cascade** — watcher + scanner + worker, all in-process tasks. The cascade subsystem itself is three independent loops: | Loop | Source signal | Effect | |---|---|---| | Watcher | `watchdog` filesystem events (sync thread) | `md_change_state.upsert` per registered kind | | Scanner | Periodic walk (`scan_interval_seconds`, default 30 s) | Same — catches changes the watcher missed | | Worker | `claim_pending_batch` polling (default 1 s when idle) | Handler dispatch → LanceDB upsert / delete | Every loop talks to the same `md_change_state` sqlite table. The worker's claim mode (`pending → processing → done/failed`) keeps concurrent workers honest. ## Health: `everos cascade status` ``` queue: pending: 3 done: 1247 failed (retryable=TRUE): 1 (eligible for `cascade fix --apply`) failed (retryable=FALSE): 1 (fix md and re-save to recover) lsn: max: 1252 last_processed: 1250 lag: 2 ``` - `lag > 0` means the worker is behind. Steady state should hover near zero; sustained lag points at a slow handler or a stuck retry. - `failed (retryable=FALSE)` is always user-actionable. Cascade will never auto-clear these — they represent malformed md the user must edit. ## Recovering from failures: `everos cascade fix` `cascade fix` (no flag) lists every failed row. With `--apply`: 1. `UPDATE md_change_state SET status='pending', retry_count=0 WHERE status='failed' AND retryable=TRUE` (the partial index `idx_md_change_retryable` makes this O(retryable)). 2. Drain the worker once so the retry runs synchronously. Retryable failures cover transient embedding / HTTP errors (5xx, 429, network resets) after the inline `MAX_RETRY=3` was exhausted. The fix command resets the counter so a working backend gets a clean start. `retryable=FALSE` rows require the user to edit the md (typically a YAML frontmatter issue) and re-save; the watcher picks the change up naturally. ## One-shot replay: `everos cascade sync [PATH]` Use this when the watcher missed an event (WSL mount, network share, external editor with no inotify) or when you want a deterministic flush before, say, a smoke test: ```bash everos cascade sync # drain everything pending everos cascade sync users/u1/episodes/X.md # re-enqueue + drain ``` The CLI builds the same `CascadeOrchestrator` as the daemon but only calls `sync_once` / `drain_once` — no watcher / scanner background task. So it's safe to run in parallel with a live `everos server`. ## Recovery paths ### LanceDB schema drift on startup `LanceDBLifespanProvider.startup` calls `verify_business_schemas`. If an on-disk table has columns the current Pydantic schema does not declare (or vice versa), the boot fails with: ``` LanceDB table 'episode' schema drift: missing=[...], extra=[...]. The index is rebuildable from md — recover with `rm -rf ~/.everos/.index/lancedb` and restart. ``` This is the documented recovery: delete the index, restart the server, the scanner will pick up every md file on its first sweep and the worker repopulates LanceDB. Markdown is the source of truth, so no data is lost. ### inotify watch-limit exhaustion (Linux) Default kernel limit is 8 192 watches per user. On a sizeable memory root the watcher may silently miss events. Symptoms: - Scanner catches the file changes but the watcher never logs an event for the same path. - `cat /proc/sys/fs/inotify/max_user_watches` is at the limit. Fix by bumping the kernel parameter: ```bash echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf sudo sysctl -p ``` ### WSL2 / network mounts Filesystem events do not propagate from the Windows host into WSL2 (or across most SMB / NFS shares). The watcher will start without error and silently see nothing. Workarounds: - Rely on the scanner — at default 30 s interval, throughput is bounded but eventually-consistent. - Drop the scan interval to ~5 s if the memory root is small. - Run `everos cascade sync` explicitly after batch edits. ### Daemon process crash mid-batch `claim_pending_batch` flips rows to `processing` *atomically*. If the process dies before `mark_done` / `mark_failed`, those rows stay in `processing` until the next boot. **The orchestrator auto-recovers** on startup: `CascadeOrchestrator.start` calls `md_change_state_repo.recover_orphan_processing()` before launching the watcher / scanner / worker, which resets every `processing` row back to `pending`. Single-process cascade means no race — at boot time no other worker could legitimately own a `processing` row. No operator action required; the structured log line `cascade_recovered_orphan_processing` reports the count when it fires. ### FD exhaustion (`os error 24` / EMFILE) Symptoms (any of these on a long-running daemon): - LanceDB query / index build fails with `lance error: ... Too many open files (os error 24)`. - `lsof -p | wc -l` grows monotonically over hours / days. - Health log lines like `cascade_lancedb_optimize_failed` / `cascade_lancedb_rebuild_failed` carrying `OSError: [Errno 24]`. Cause (verified against `lance crate 4.0`): the LanceDB *index* cache (`GlobalIndexCache`) holds one reader object per opened FTS / vector / scalar index, and each reader pins the file descriptors of its `_indices//...` files. With a long-running daemon and steady- state cascade ingest, every `optimize()` call adds new readers; with LanceDB's own default (`index_cache_size_bytes=None`, unbounded), they **are never evicted** and the FDs leak monotonically. `drop_index` does **not** help — it is a manifest-only operation and leaves the on-disk UUID directories untouched. Even an explicit `optimize(cleanup_older_than=0)` `unlink()`-ing the files does not release FDs: POSIX keeps the inode alive as long as a process holds an open FD on it (the entries show as `(deleted)` in `lsof`). Only an LRU eviction inside the cache (or a connection close) actually closes the FDs. Fix (already wired in `LanceDBSettings.index_cache_size_bytes` — default 16 MB, ~290 FD ceiling): see [Tuning knobs § LanceDB index cache](#lancedb-index-cache-index_cache_size_bytes) for the sizing table and the env-var override path. If you have already hit EMFILE in a running process, the cleanest recovery is a daemon restart — the open connection closes, every FD is released, and the next start comes up with the capped Session in place. ## Tuning knobs ### Cascade scheduler knobs All defaults live in `everos.memory.cascade.orchestrator.CascadeConfig` and `everos.memory.cascade.worker.CascadeWorker`: | Knob | Default | Effect | |---|---|---| | `scan_interval_seconds` | 30 | Scanner sweep cadence | | `worker_batch_size` | 50 | Rows claimed per worker cycle | | `worker_max_retry` | 3 | Inline retries before `mark_failed(retryable=TRUE)` | | `worker_poll_interval_seconds` | 1 | Idle wait between empty drain attempts | | `worker_retry_backoff_seconds` | 2 | Linear backoff seed; doubles per attempt | Tuning surface is intentionally not in `Settings` yet — once we have wall-clock numbers from real workloads, the values that need operator override will surface there. ### LanceDB index cache (`index_cache_size_bytes`) Lives in `LanceDBSettings`; overridable via the `EVEROS_LANCEDB__INDEX_CACHE_SIZE_BYTES` environment variable. This is the only knob that bounds the steady-state file-descriptor count of a long-running EverOS daemon — see [Recovery paths § FD exhaustion](#fd-exhaustion-os-error-24-emfile) for why nothing else (prune, rebuild, `drop_index`) helps. Measured cap → FD ceiling (30 add+optimize cycles + 100-query stress on the real `Episode` schema): | Cap | FD ceiling | Query latency (p50) | Safe under `ulimit -n` | |---|---|---|---| | `2 MB` | ~45 | ~5 ms | macOS default 256 (5× headroom) | | `4 MB` | ~52 | ~3 ms | macOS default 256 | | `8 MB` | ~140 | ~2.4 ms | macOS default 256 (1.8× headroom) | | **`16 MB`** (default) | **~290** | **~2.3 ms** | **Linux default 1024 (3.5× headroom); macOS needs `ulimit -n 1024`** | | `32 MB` | ~630 | ~1.4 ms | Linux default 1024 (1.6× headroom) | | `unbounded` | grows forever | ~1.3 ms | NEVER use in a daemon | EverOS's measured steady-state working set after a `rebuild_indexes` cycle is roughly **50-100 readers / 3-6 MB resident** (5 tables × ~7 BM25 columns × ~10 `part_N` reader entries each), so the 16 MB default provides ~3× headroom for burst traffic and stale-but-not-yet-evicted readers. When to override: - **Tight `ulimit -n` environments** (containers; macOS dev boxes that haven't bumped the default 256) → drop to `4 MB` or `8 MB`. Query latency increases by ~1-3 ms but correctness is unaffected. - **Larger working sets** (many more tables or much wider FTS indexes than the default schema set) → bump to `32-64 MB`. Verify your platform's `ulimit -n` covers the corresponding FD ceiling with at least 2× headroom. - **Diagnostic-only**: set to a tiny value (e.g. `1 MB`) to *force* LRU thrashing and reproduce cache-miss latency in tests. Do **not** set `metadata_cache_size_bytes` — it is intentionally left at LanceDB's default (unbounded) because the metadata cache holds parsed manifests / fragment stats and has zero effect on FD count; capping it just thrashes parsing work without solving anything. ## Concurrency The worker is async, not multi-process. Inside one drain cycle, `asyncio.gather(*[_process_one(row) for row in batch])` runs every claimed row concurrently — cascade is IO-bound (embedding HTTP calls dominate wall time) so single-process coroutine concurrency saturates the bottleneck. The `worker_batch_size` knob (default 50) caps in-flight rows. Multi-process workers are a scaling axis we'd reach for only if a single process becomes CPU-bound, which the current design does not anticipate. `claim_pending_batch` is already race-safe (the ``WHERE status='pending'`` filter ensures each row lands in exactly one batch even if multiple workers raced), so adding processes later is a deployment-side change with no schema work. ## What cascade does NOT do (yet) - **Schema migration**: LanceDB column changes require `rm -rf`. - **Parent-id back-link**: Episode rows currently carry `parent_id=None`; the writer doesn't preserve the source memcell id in the entry inline. Tracked separately. - **Reference-file change detection (agent_skill)**: edits to `references/*.md` siblings won't trigger a re-index — only changes to `SKILL.md` itself fire the watcher. Workaround: run `everos cascade sync agents//skills/skill_/SKILL.md` after editing references.