chore: initialize EverOS 1.0.0
md-first memory extraction framework for AI agents. Markdown is the single source of truth; SQLite holds state and LanceDB provides the rebuildable vector + BM25 + scalar index. The codebase follows a single-direction DDD layering (entrypoints -> service -> memory -> infra, with component / core / config cross-cutting) enforced by import-linter. Engineering surface: - Coding conventions in .claude/rules/ (path-scoped) and workflows in .claude/skills/ (/commit, /new-branch, /pr). - GitHub Actions CI runs make lint + test + integration; pre-commit mirrors the gates locally (ruff, hygiene hooks, gitlint commit-msg). - Commit messages follow Conventional Commits, enforced by gitlint. - make lint also enforces datetime two-zone discipline and OpenAPI drift.
This commit is contained in:
271
docs/cascade_runbook.md
Normal file
271
docs/cascade_runbook.md
Normal file
@ -0,0 +1,271 @@
|
||||
# Cascade Runbook
|
||||
|
||||
The cascade daemon keeps LanceDB in sync with the markdown files under
|
||||
the memory root. Service / entry points only ever write markdown; the
|
||||
daemon is the **sole** writer of the LanceDB index. This runbook covers
|
||||
the recurring operational questions.
|
||||
|
||||
## What runs where
|
||||
|
||||
When `everos server start` boots, the FastAPI lifespan wires four
|
||||
providers in order:
|
||||
|
||||
1. **Metrics** — Prometheus collector.
|
||||
2. **SQLite** — system DB + schema (`SQLModel.metadata.create_all`).
|
||||
3. **LanceDB** — async connection + schema verification + FTS indexes.
|
||||
4. **Cascade** — watcher + scanner + worker, all in-process tasks.
|
||||
|
||||
The cascade subsystem itself is three independent loops:
|
||||
|
||||
| Loop | Source signal | Effect |
|
||||
|---|---|---|
|
||||
| Watcher | `watchdog` filesystem events (sync thread) | `md_change_state.upsert` per registered kind |
|
||||
| Scanner | Periodic walk (`scan_interval_seconds`, default 30 s) | Same — catches changes the watcher missed |
|
||||
| Worker | `claim_pending_batch` polling (default 1 s when idle) | Handler dispatch → LanceDB upsert / delete |
|
||||
|
||||
Every loop talks to the same `md_change_state` sqlite table. The
|
||||
worker's claim mode (`pending → processing → done/failed`) keeps
|
||||
concurrent workers honest.
|
||||
|
||||
## Health: `everos cascade status`
|
||||
|
||||
```
|
||||
queue:
|
||||
pending: 3
|
||||
done: 1247
|
||||
failed (retryable=TRUE): 1 (eligible for `cascade fix --apply`)
|
||||
failed (retryable=FALSE): 1 (fix md and re-save to recover)
|
||||
lsn:
|
||||
max: 1252
|
||||
last_processed: 1250
|
||||
lag: 2
|
||||
```
|
||||
|
||||
- `lag > 0` means the worker is behind. Steady state should hover near
|
||||
zero; sustained lag points at a slow handler or a stuck retry.
|
||||
- `failed (retryable=FALSE)` is always user-actionable. Cascade will
|
||||
never auto-clear these — they represent malformed md the user must
|
||||
edit.
|
||||
|
||||
## Recovering from failures: `everos cascade fix`
|
||||
|
||||
`cascade fix` (no flag) lists every failed row. With `--apply`:
|
||||
|
||||
1. `UPDATE md_change_state SET status='pending', retry_count=0
|
||||
WHERE status='failed' AND retryable=TRUE` (the partial index
|
||||
`idx_md_change_retryable` makes this O(retryable)).
|
||||
2. Drain the worker once so the retry runs synchronously.
|
||||
|
||||
Retryable failures cover transient embedding / HTTP errors (5xx, 429,
|
||||
network resets) after the inline `MAX_RETRY=3` was exhausted. The
|
||||
fix command resets the counter so a working backend gets a clean
|
||||
start.
|
||||
|
||||
`retryable=FALSE` rows require the user to edit the md (typically a
|
||||
YAML frontmatter issue) and re-save; the watcher picks the change up
|
||||
naturally.
|
||||
|
||||
## One-shot replay: `everos cascade sync [PATH]`
|
||||
|
||||
Use this when the watcher missed an event (WSL mount, network share,
|
||||
external editor with no inotify) or when you want a deterministic
|
||||
flush before, say, a smoke test:
|
||||
|
||||
```bash
|
||||
everos cascade sync # drain everything pending
|
||||
everos cascade sync users/u1/episodes/X.md # re-enqueue + drain
|
||||
```
|
||||
|
||||
The CLI builds the same `CascadeOrchestrator` as the daemon but only
|
||||
calls `sync_once` / `drain_once` — no watcher / scanner background
|
||||
task. So it's safe to run in parallel with a live `everos server`.
|
||||
|
||||
## Recovery paths
|
||||
|
||||
### LanceDB schema drift on startup
|
||||
|
||||
`LanceDBLifespanProvider.startup` calls `verify_business_schemas`. If
|
||||
an on-disk table has columns the current Pydantic schema does not
|
||||
declare (or vice versa), the boot fails with:
|
||||
|
||||
```
|
||||
LanceDB table 'episode' schema drift: missing=[...], extra=[...].
|
||||
The index is rebuildable from md — recover with
|
||||
`rm -rf ~/.everos/.index/lancedb` and restart.
|
||||
```
|
||||
|
||||
This is the documented recovery: delete the index, restart the
|
||||
server, the scanner will pick up every md file on its first sweep and
|
||||
the worker repopulates LanceDB. Markdown is the source of truth, so
|
||||
no data is lost.
|
||||
|
||||
### inotify watch-limit exhaustion (Linux)
|
||||
|
||||
Default kernel limit is 8 192 watches per user. On a sizeable memory
|
||||
root the watcher may silently miss events. Symptoms:
|
||||
|
||||
- Scanner catches the file changes but the watcher never logs an
|
||||
event for the same path.
|
||||
- `cat /proc/sys/fs/inotify/max_user_watches` is at the limit.
|
||||
|
||||
Fix by bumping the kernel parameter:
|
||||
|
||||
```bash
|
||||
echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf
|
||||
sudo sysctl -p
|
||||
```
|
||||
|
||||
### WSL2 / network mounts
|
||||
|
||||
Filesystem events do not propagate from the Windows host into WSL2
|
||||
(or across most SMB / NFS shares). The watcher will start without
|
||||
error and silently see nothing.
|
||||
|
||||
Workarounds:
|
||||
|
||||
- Rely on the scanner — at default 30 s interval, throughput is
|
||||
bounded but eventually-consistent.
|
||||
- Drop the scan interval to ~5 s if the memory root is small.
|
||||
- Run `everos cascade sync` explicitly after batch edits.
|
||||
|
||||
### Daemon process crash mid-batch
|
||||
|
||||
`claim_pending_batch` flips rows to `processing` *atomically*. If the
|
||||
process dies before `mark_done` / `mark_failed`, those rows stay in
|
||||
`processing` until the next boot. **The orchestrator auto-recovers**
|
||||
on startup: `CascadeOrchestrator.start` calls
|
||||
`md_change_state_repo.recover_orphan_processing()` before launching
|
||||
the watcher / scanner / worker, which resets every `processing` row
|
||||
back to `pending`. Single-process cascade means no race — at boot
|
||||
time no other worker could legitimately own a `processing` row.
|
||||
|
||||
No operator action required; the structured log line
|
||||
`cascade_recovered_orphan_processing` reports the count when it
|
||||
fires.
|
||||
|
||||
### FD exhaustion (`os error 24` / EMFILE)
|
||||
|
||||
Symptoms (any of these on a long-running daemon):
|
||||
|
||||
- LanceDB query / index build fails with `lance error: ... Too many
|
||||
open files (os error 24)`.
|
||||
- `lsof -p <pid> | wc -l` grows monotonically over hours / days.
|
||||
- Health log lines like `cascade_lancedb_optimize_failed` /
|
||||
`cascade_lancedb_rebuild_failed` carrying `OSError: [Errno 24]`.
|
||||
|
||||
Cause (verified against `lance crate 4.0`): the LanceDB *index* cache
|
||||
(`GlobalIndexCache`) holds one reader object per opened FTS / vector
|
||||
/ scalar index, and each reader pins the file descriptors of its
|
||||
`_indices/<uuid>/...` files. With a long-running daemon and steady-
|
||||
state cascade ingest, every `optimize()` call adds new readers; with
|
||||
LanceDB's own default (`index_cache_size_bytes=None`, unbounded), they
|
||||
**are never evicted** and the FDs leak monotonically.
|
||||
|
||||
`drop_index` does **not** help — it is a manifest-only operation and
|
||||
leaves the on-disk UUID directories untouched. Even an explicit
|
||||
`optimize(cleanup_older_than=0)` `unlink()`-ing the files does not
|
||||
release FDs: POSIX keeps the inode alive as long as a process holds
|
||||
an open FD on it (the entries show as `(deleted)` in `lsof`). Only an
|
||||
LRU eviction inside the cache (or a connection close) actually closes
|
||||
the FDs.
|
||||
|
||||
Fix (already wired in `LanceDBSettings.index_cache_size_bytes` —
|
||||
default 16 MB, ~290 FD ceiling): see
|
||||
[Tuning knobs § LanceDB index cache](#lancedb-index-cache-index_cache_size_bytes)
|
||||
for the sizing table and the env-var override path.
|
||||
|
||||
If you have already hit EMFILE in a running process, the cleanest
|
||||
recovery is a daemon restart — the open connection closes, every FD
|
||||
is released, and the next start comes up with the capped Session in
|
||||
place.
|
||||
|
||||
## Tuning knobs
|
||||
|
||||
### Cascade scheduler knobs
|
||||
|
||||
All defaults live in `everos.memory.cascade.orchestrator.CascadeConfig`
|
||||
and `everos.memory.cascade.worker.CascadeWorker`:
|
||||
|
||||
| Knob | Default | Effect |
|
||||
|---|---|---|
|
||||
| `scan_interval_seconds` | 30 | Scanner sweep cadence |
|
||||
| `worker_batch_size` | 50 | Rows claimed per worker cycle |
|
||||
| `worker_max_retry` | 3 | Inline retries before `mark_failed(retryable=TRUE)` |
|
||||
| `worker_poll_interval_seconds` | 1 | Idle wait between empty drain attempts |
|
||||
| `worker_retry_backoff_seconds` | 2 | Linear backoff seed; doubles per attempt |
|
||||
|
||||
Tuning surface is intentionally not in `Settings` yet — once we have
|
||||
wall-clock numbers from real workloads, the values that need
|
||||
operator override will surface there.
|
||||
|
||||
### LanceDB index cache (`index_cache_size_bytes`)
|
||||
|
||||
Lives in `LanceDBSettings`; overridable via the
|
||||
`EVEROS_LANCEDB__INDEX_CACHE_SIZE_BYTES` environment variable. This
|
||||
is the only knob that bounds the steady-state file-descriptor count
|
||||
of a long-running EverOS daemon — see
|
||||
[Recovery paths § FD exhaustion](#fd-exhaustion-os-error-24-emfile)
|
||||
for why nothing else (prune, rebuild, `drop_index`) helps.
|
||||
|
||||
Measured cap → FD ceiling (30 add+optimize cycles + 100-query stress
|
||||
on the real `Episode` schema):
|
||||
|
||||
| Cap | FD ceiling | Query latency (p50) | Safe under `ulimit -n` |
|
||||
|---|---|---|---|
|
||||
| `2 MB` | ~45 | ~5 ms | macOS default 256 (5× headroom) |
|
||||
| `4 MB` | ~52 | ~3 ms | macOS default 256 |
|
||||
| `8 MB` | ~140 | ~2.4 ms | macOS default 256 (1.8× headroom) |
|
||||
| **`16 MB`** (default) | **~290** | **~2.3 ms** | **Linux default 1024 (3.5× headroom); macOS needs `ulimit -n 1024`** |
|
||||
| `32 MB` | ~630 | ~1.4 ms | Linux default 1024 (1.6× headroom) |
|
||||
| `unbounded` | grows forever | ~1.3 ms | NEVER use in a daemon |
|
||||
|
||||
EverOS's measured steady-state working set after a `rebuild_indexes`
|
||||
cycle is roughly **50-100 readers / 3-6 MB resident** (5 tables × ~7
|
||||
BM25 columns × ~10 `part_N` reader entries each), so the 16 MB default
|
||||
provides ~3× headroom for burst traffic and stale-but-not-yet-evicted
|
||||
readers.
|
||||
|
||||
When to override:
|
||||
|
||||
- **Tight `ulimit -n` environments** (containers; macOS dev boxes
|
||||
that haven't bumped the default 256) → drop to `4 MB` or `8 MB`.
|
||||
Query latency increases by ~1-3 ms but correctness is unaffected.
|
||||
- **Larger working sets** (many more tables or much wider FTS
|
||||
indexes than the default schema set) → bump to `32-64 MB`. Verify
|
||||
your platform's `ulimit -n` covers the corresponding FD ceiling
|
||||
with at least 2× headroom.
|
||||
- **Diagnostic-only**: set to a tiny value (e.g. `1 MB`) to
|
||||
*force* LRU thrashing and reproduce cache-miss latency in tests.
|
||||
|
||||
Do **not** set `metadata_cache_size_bytes` — it is intentionally left
|
||||
at LanceDB's default (unbounded) because the metadata cache holds
|
||||
parsed manifests / fragment stats and has zero effect on FD count;
|
||||
capping it just thrashes parsing work without solving anything.
|
||||
|
||||
## Concurrency
|
||||
|
||||
The worker is async, not multi-process. Inside one drain cycle,
|
||||
`asyncio.gather(*[_process_one(row) for row in batch])` runs every
|
||||
claimed row concurrently — cascade is IO-bound (embedding HTTP calls
|
||||
dominate wall time) so single-process coroutine concurrency saturates
|
||||
the bottleneck. The `worker_batch_size` knob (default 50) caps
|
||||
in-flight rows.
|
||||
|
||||
Multi-process workers are a scaling axis we'd reach for only if a
|
||||
single process becomes CPU-bound, which the current design does not
|
||||
anticipate. `claim_pending_batch` is already race-safe (the
|
||||
``WHERE status='pending'`` filter ensures each row lands in exactly
|
||||
one batch even if multiple workers raced), so adding processes later
|
||||
is a deployment-side change with no schema work.
|
||||
|
||||
## What cascade does NOT do (yet)
|
||||
|
||||
- **Schema migration**: LanceDB column changes require `rm -rf`.
|
||||
- **Parent-id back-link**: Episode rows currently carry
|
||||
`parent_id=None`; the writer doesn't preserve the source memcell id
|
||||
in the entry inline. Tracked separately.
|
||||
- **Reference-file change detection (agent_skill)**: edits to
|
||||
`references/*.md` siblings won't trigger a re-index — only changes
|
||||
to `SKILL.md` itself fire the watcher. Workaround: run
|
||||
`everos cascade sync agents/<a>/skills/skill_<n>/SKILL.md` after
|
||||
editing references.
|
||||
Reference in New Issue
Block a user