Files
EverOS/docs/cascade_runbook.md
0xVox 9fc6ad20d2 fix(docs): repair dead xrefs in api.md, runbook, skill (#269)
Three internal documentation references pointed at non-existent targets:

- docs/api.md: MessageItem.content linked to #addmessage, which has no
  heading or anchor; corrected to #messageitem (the slug used by every
  other MessageItem cross-reference and matching the ### MessageItem
  heading).
- docs/cascade_runbook.md: the FD-exhaustion cross-ref used a single
  hyphen where the GitHub slug of "FD exhaustion (`os error 24` /
  EMFILE)" has a double hyphen (from the ` / ` separator); corrected to
  #fd-exhaustion-os-error-24--emfile.
- use-cases/claude-code-plugin/skills/memory-tools.md: the always-injected
  skill named two tools (search_memories, get_memory) that the MCP server
  never exposes; replaced with the real evermem_search tool and its
  params (query required, limit default 10 / max 20).

Markdown-only; no runtime behavior change.
2026-06-08 07:10:56 +08:00

272 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Cascade Runbook
The cascade daemon keeps LanceDB in sync with the markdown files under
the memory root. Service / entry points only ever write markdown; the
daemon is the **sole** writer of the LanceDB index. This runbook covers
the recurring operational questions.
## What runs where
When `everos server start` boots, the FastAPI lifespan wires four
providers in order:
1. **Metrics** — Prometheus collector.
2. **SQLite** — system DB + schema (`SQLModel.metadata.create_all`).
3. **LanceDB** — async connection + schema verification + FTS indexes.
4. **Cascade** — watcher + scanner + worker, all in-process tasks.
The cascade subsystem itself is three independent loops:
| Loop | Source signal | Effect |
|---|---|---|
| Watcher | `watchdog` filesystem events (sync thread) | `md_change_state.upsert` per registered kind |
| Scanner | Periodic walk (`scan_interval_seconds`, default 30 s) | Same — catches changes the watcher missed |
| Worker | `claim_pending_batch` polling (default 1 s when idle) | Handler dispatch → LanceDB upsert / delete |
Every loop talks to the same `md_change_state` sqlite table. The
worker's claim mode (`pending → processing → done/failed`) keeps
concurrent workers honest.
## Health: `everos cascade status`
```
queue:
pending: 3
done: 1247
failed (retryable=TRUE): 1 (eligible for `cascade fix --apply`)
failed (retryable=FALSE): 1 (fix md and re-save to recover)
lsn:
max: 1252
last_processed: 1250
lag: 2
```
- `lag > 0` means the worker is behind. Steady state should hover near
zero; sustained lag points at a slow handler or a stuck retry.
- `failed (retryable=FALSE)` is always user-actionable. Cascade will
never auto-clear these — they represent malformed md the user must
edit.
## Recovering from failures: `everos cascade fix`
`cascade fix` (no flag) lists every failed row. With `--apply`:
1. `UPDATE md_change_state SET status='pending', retry_count=0
WHERE status='failed' AND retryable=TRUE` (the partial index
`idx_md_change_retryable` makes this O(retryable)).
2. Drain the worker once so the retry runs synchronously.
Retryable failures cover transient embedding / HTTP errors (5xx, 429,
network resets) after the inline `MAX_RETRY=3` was exhausted. The
fix command resets the counter so a working backend gets a clean
start.
`retryable=FALSE` rows require the user to edit the md (typically a
YAML frontmatter issue) and re-save; the watcher picks the change up
naturally.
## One-shot replay: `everos cascade sync [PATH]`
Use this when the watcher missed an event (WSL mount, network share,
external editor with no inotify) or when you want a deterministic
flush before, say, a smoke test:
```bash
everos cascade sync # drain everything pending
everos cascade sync users/u1/episodes/X.md # re-enqueue + drain
```
The CLI builds the same `CascadeOrchestrator` as the daemon but only
calls `sync_once` / `drain_once` — no watcher / scanner background
task. So it's safe to run in parallel with a live `everos server`.
## Recovery paths
### LanceDB schema drift on startup
`LanceDBLifespanProvider.startup` calls `verify_business_schemas`. If
an on-disk table has columns the current Pydantic schema does not
declare (or vice versa), the boot fails with:
```
LanceDB table 'episode' schema drift: missing=[...], extra=[...].
The index is rebuildable from md — recover with
`rm -rf ~/.everos/.index/lancedb` and restart.
```
This is the documented recovery: delete the index, restart the
server, the scanner will pick up every md file on its first sweep and
the worker repopulates LanceDB. Markdown is the source of truth, so
no data is lost.
### inotify watch-limit exhaustion (Linux)
Default kernel limit is 8 192 watches per user. On a sizeable memory
root the watcher may silently miss events. Symptoms:
- Scanner catches the file changes but the watcher never logs an
event for the same path.
- `cat /proc/sys/fs/inotify/max_user_watches` is at the limit.
Fix by bumping the kernel parameter:
```bash
echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```
### WSL2 / network mounts
Filesystem events do not propagate from the Windows host into WSL2
(or across most SMB / NFS shares). The watcher will start without
error and silently see nothing.
Workarounds:
- Rely on the scanner — at default 30 s interval, throughput is
bounded but eventually-consistent.
- Drop the scan interval to ~5 s if the memory root is small.
- Run `everos cascade sync` explicitly after batch edits.
### Daemon process crash mid-batch
`claim_pending_batch` flips rows to `processing` *atomically*. If the
process dies before `mark_done` / `mark_failed`, those rows stay in
`processing` until the next boot. **The orchestrator auto-recovers**
on startup: `CascadeOrchestrator.start` calls
`md_change_state_repo.recover_orphan_processing()` before launching
the watcher / scanner / worker, which resets every `processing` row
back to `pending`. Single-process cascade means no race — at boot
time no other worker could legitimately own a `processing` row.
No operator action required; the structured log line
`cascade_recovered_orphan_processing` reports the count when it
fires.
### FD exhaustion (`os error 24` / EMFILE)
Symptoms (any of these on a long-running daemon):
- LanceDB query / index build fails with `lance error: ... Too many
open files (os error 24)`.
- `lsof -p <pid> | wc -l` grows monotonically over hours / days.
- Health log lines like `cascade_lancedb_optimize_failed` /
`cascade_lancedb_rebuild_failed` carrying `OSError: [Errno 24]`.
Cause (verified against `lance crate 4.0`): the LanceDB *index* cache
(`GlobalIndexCache`) holds one reader object per opened FTS / vector
/ scalar index, and each reader pins the file descriptors of its
`_indices/<uuid>/...` files. With a long-running daemon and steady-
state cascade ingest, every `optimize()` call adds new readers; with
LanceDB's own default (`index_cache_size_bytes=None`, unbounded), they
**are never evicted** and the FDs leak monotonically.
`drop_index` does **not** help — it is a manifest-only operation and
leaves the on-disk UUID directories untouched. Even an explicit
`optimize(cleanup_older_than=0)` `unlink()`-ing the files does not
release FDs: POSIX keeps the inode alive as long as a process holds
an open FD on it (the entries show as `(deleted)` in `lsof`). Only an
LRU eviction inside the cache (or a connection close) actually closes
the FDs.
Fix (already wired in `LanceDBSettings.index_cache_size_bytes` —
default 16 MB, ~290 FD ceiling): see
[Tuning knobs § LanceDB index cache](#lancedb-index-cache-index_cache_size_bytes)
for the sizing table and the env-var override path.
If you have already hit EMFILE in a running process, the cleanest
recovery is a daemon restart — the open connection closes, every FD
is released, and the next start comes up with the capped Session in
place.
## Tuning knobs
### Cascade scheduler knobs
All defaults live in `everos.memory.cascade.orchestrator.CascadeConfig`
and `everos.memory.cascade.worker.CascadeWorker`:
| Knob | Default | Effect |
|---|---|---|
| `scan_interval_seconds` | 30 | Scanner sweep cadence |
| `worker_batch_size` | 50 | Rows claimed per worker cycle |
| `worker_max_retry` | 3 | Inline retries before `mark_failed(retryable=TRUE)` |
| `worker_poll_interval_seconds` | 1 | Idle wait between empty drain attempts |
| `worker_retry_backoff_seconds` | 2 | Linear backoff seed; doubles per attempt |
Tuning surface is intentionally not in `Settings` yet — once we have
wall-clock numbers from real workloads, the values that need
operator override will surface there.
### LanceDB index cache (`index_cache_size_bytes`)
Lives in `LanceDBSettings`; overridable via the
`EVEROS_LANCEDB__INDEX_CACHE_SIZE_BYTES` environment variable. This
is the only knob that bounds the steady-state file-descriptor count
of a long-running EverOS daemon — see
[Recovery paths § FD exhaustion](#fd-exhaustion-os-error-24--emfile)
for why nothing else (prune, rebuild, `drop_index`) helps.
Measured cap → FD ceiling (30 add+optimize cycles + 100-query stress
on the real `Episode` schema):
| Cap | FD ceiling | Query latency (p50) | Safe under `ulimit -n` |
|---|---|---|---|
| `2 MB` | ~45 | ~5 ms | macOS default 256 (5× headroom) |
| `4 MB` | ~52 | ~3 ms | macOS default 256 |
| `8 MB` | ~140 | ~2.4 ms | macOS default 256 (1.8× headroom) |
| **`16 MB`** (default) | **~290** | **~2.3 ms** | **Linux default 1024 (3.5× headroom); macOS needs `ulimit -n 1024`** |
| `32 MB` | ~630 | ~1.4 ms | Linux default 1024 (1.6× headroom) |
| `unbounded` | grows forever | ~1.3 ms | NEVER use in a daemon |
EverOS's measured steady-state working set after a `rebuild_indexes`
cycle is roughly **50-100 readers / 3-6 MB resident** (5 tables × ~7
BM25 columns × ~10 `part_N` reader entries each), so the 16 MB default
provides ~3× headroom for burst traffic and stale-but-not-yet-evicted
readers.
When to override:
- **Tight `ulimit -n` environments** (containers; macOS dev boxes
that haven't bumped the default 256) → drop to `4 MB` or `8 MB`.
Query latency increases by ~1-3 ms but correctness is unaffected.
- **Larger working sets** (many more tables or much wider FTS
indexes than the default schema set) → bump to `32-64 MB`. Verify
your platform's `ulimit -n` covers the corresponding FD ceiling
with at least 2× headroom.
- **Diagnostic-only**: set to a tiny value (e.g. `1 MB`) to
*force* LRU thrashing and reproduce cache-miss latency in tests.
Do **not** set `metadata_cache_size_bytes` — it is intentionally left
at LanceDB's default (unbounded) because the metadata cache holds
parsed manifests / fragment stats and has zero effect on FD count;
capping it just thrashes parsing work without solving anything.
## Concurrency
The worker is async, not multi-process. Inside one drain cycle,
`asyncio.gather(*[_process_one(row) for row in batch])` runs every
claimed row concurrently — cascade is IO-bound (embedding HTTP calls
dominate wall time) so single-process coroutine concurrency saturates
the bottleneck. The `worker_batch_size` knob (default 50) caps
in-flight rows.
Multi-process workers are a scaling axis we'd reach for only if a
single process becomes CPU-bound, which the current design does not
anticipate. `claim_pending_batch` is already race-safe (the
``WHERE status='pending'`` filter ensures each row lands in exactly
one batch even if multiple workers raced), so adding processes later
is a deployment-side change with no schema work.
## What cascade does NOT do (yet)
- **Schema migration**: LanceDB column changes require `rm -rf`.
- **Parent-id back-link**: Episode rows currently carry
`parent_id=None`; the writer doesn't preserve the source memcell id
in the entry inline. Tracked separately.
- **Reference-file change detection (agent_skill)**: edits to
`references/*.md` siblings won't trigger a re-index — only changes
to `SKILL.md` itself fire the watcher. Workaround: run
`everos cascade sync agents/<a>/skills/skill_<n>/SKILL.md` after
editing references.