md-first memory extraction framework for AI agents. Markdown is the single source of truth; SQLite holds state and LanceDB provides the rebuildable vector + BM25 + scalar index. The codebase follows a single-direction DDD layering (entrypoints -> service -> memory -> infra, with component / core / config cross-cutting) enforced by import-linter. Engineering surface: - Coding conventions in .claude/rules/ (path-scoped) and workflows in .claude/skills/ (/commit, /new-branch, /pr). - GitHub Actions CI runs make lint + test + integration; pre-commit mirrors the gates locally (ruff, hygiene hooks, gitlint commit-msg). - Commit messages follow Conventional Commits, enforced by gitlint. - make lint also enforces datetime two-zone discipline and OpenAPI drift.
18 lines
550 B
Python
18 lines
550 B
Python
"""Factory for the cascade-time tokenizer.
|
|
|
|
Single implementation today (``JiebaTokenizer``). Lifting this into a
|
|
factory keeps callers (cascade handler) decoupled from the concrete
|
|
choice, so swapping to char-bigram / hf tokenizer later is a one-file
|
|
change — see ``17_lancedb_tables_design.md`` §2.4.1.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
from .jieba_provider import JiebaTokenizer
|
|
from .protocol import Tokenizer
|
|
|
|
|
|
def build_tokenizer() -> Tokenizer:
|
|
"""Build the default tokenizer (``JiebaTokenizer``)."""
|
|
return JiebaTokenizer()
|