classifiers.skill_catalog module

SQLite-backed index for Agent Skills (SKILL.md discovery and metadata).

classifiers.skill_catalog.stable_skill_id(skill_root, corpus_root)[source]

Derive a deterministic skill ID from its path relative to the corpus.

Produces a stable, content-independent identifier for a skill so the same skill keeps the same primary key across re-ingests (it depends only on location, not on the skill body). The relative posix path under corpus_root is hashed with SHA-256 and truncated to 32 hex chars; if the skill root is not actually under corpus_root the directory name is used as the fallback basis instead.

A pure path/hash helper with no I/O. Called by classifiers.ingest_skills (to key each upserted row) and exercised by tests/test_skill_catalog.py.

Parameters:

skill_root (Path) – Directory containing the skill (the dir that holds its SKILL.md).
corpus_root (Path) – Root of the skills corpus that skill_root lives under.

Return type:

str

Returns:

A 32-character lowercase hex string usable as the skill’s primary key.

classifiers.skill_catalog.canonical_skill_sort_key(skill_dir, corpus_root)[source]

Rank a skill directory so canonical sources win when deduping.

Returns a sort key that establishes source precedence among otherwise identical skills: git-cloned corpora under repos sort first (bucket 0), npx-installed copies next (bucket 1), everything else after (bucket 2), and anything outside corpus_root last (bucket 99). When several directories share the same body_hash, sorting by this key and keeping the first ensures the git-managed copy is treated as canonical rather than a transient npx install.

A pure path-classification helper with no I/O. Called by classifiers.ingest_skills as the key when sorting discovered skill directories before deduplication. No other in-repo callers were found.

Parameters:

skill_dir (Path) – Directory of the skill being ranked.
corpus_root (Path) – Root of the corpus used to compute the relative path and identify the repos / npx top-level bucket.

Return type:

tuple[int, str]

Returns:

A (bucket, relative_posix_path) tuple; lower buckets and lexicographically smaller paths sort first.

classifiers.skill_catalog.discover_skill_dirs(root, *, max_depth=8)[source]

Find every directory under root that directly holds a SKILL.md.

Entry point for skill discovery: it resolves root and runs a depth-bounded recursive walk (see the inner walk closure) that collects each directory containing a SKILL.md and prunes that branch, since skills are not nested. Directories named in _SKIP_DIR_NAMES and hidden dot-directories (except the allow-listed .agents) are ignored so VCS, cache, and dependency trees are not scanned.

Touches the filesystem (directory iteration only); unreadable directories are silently skipped. Called by classifiers.ingest_skills to enumerate the corpus and exercised by tests/test_skill_catalog.py.

Parameters:

root (Path) – Directory to scan recursively for skills.
max_depth (int) – Maximum recursion depth below root (default _MAX_SCAN_DEPTH); deeper directories are not visited.

Return type:

list[Path]

Returns:

The list of directories that directly contain a SKILL.md file.

classifiers.skill_catalog.init_db(db_path)[source]

Create the SQLite skills table (and its parent dir) if absent.

Idempotent schema bootstrap for the skill catalog database. It ensures the parent directory exists, opens db_path, and runs a CREATE TABLE IF NOT EXISTS skills so subsequent upsert_skill() and load_* calls have a table to work against. Safe to call repeatedly.

Touches the filesystem and the SQLite database (creates directories, connects, commits DDL, then closes the connection); no other I/O. Called by classifiers.ingest_skills before populating the catalog and by tests/test_skill_catalog.py.

Parameters:: db_path (Path) – Path to the SQLite database file to initialise.
Return type:: None

classifiers.skill_catalog.upsert_skill(db_path, row)[source]

Insert or replace a single skill row in the catalog database.

Upserts one skill keyed by skill_id using INSERT OR REPLACE so a re-ingested skill overwrites its prior row rather than duplicating it. The ingested_at timestamp defaults to time.time() when the caller does not supply one, recording when the row was last written.

Touches the SQLite database (connects, executes the upsert, commits, and closes); assumes the skills table already exists via init_db(). Called by classifiers.ingest_skills for each discovered skill. No other in-repo callers were found.

Parameters:

db_path (Path) – Path to the SQLite catalog database.
row (dict[str, Any]) – Mapping with skill_id, name, description, skill_md_path, skill_root, corpus_root, and body_hash keys, plus an optional ingested_at epoch float.

Return type:

None

classifiers.skill_catalog.load_skill_by_id(db_path, skill_id)[source]

Load a single skill’s catalog row by its ID.

Point lookup against the skills table used to resolve a skill’s metadata (including its skill_md_path on disk) from the stable ID. A missing database file or a missing row both yield None rather than raising, so callers can treat “unknown skill” uniformly.

Touches the SQLite database (opens read-only via a SELECT, then closes). Called by the activate_skill tool to fetch the row before reading the skill body, and exercised by tests/test_skill_catalog.py.

Parameters:

db_path (Path) – Path to the SQLite catalog database.
skill_id (str) – Stable skill identifier (see stable_skill_id()).

Return type:

dict[str, Any] | None

Returns:

A dict of the skill’s columns (without ingested_at), or None if the database file or the row does not exist.

classifiers.skill_catalog.load_all_skills(db_path)[source]

Return every skill’s metadata row from the catalog database.

Full-table scan of skills used wherever the whole catalog is needed: counting ingested skills, building embeddings over all skills, and end-to-end verification. A missing database file yields an empty list rather than raising.

Touches the SQLite database (opens, runs a SELECT of all rows, closes). Called by classifiers.ingest_skills and classifiers.update_skill_embeddings (to embed skills), by scripts/skills_corpus_pipeline.py and scripts/verify_npx_skills_e2e.py, and by tests/test_skill_catalog.py.

Parameters:: db_path (Path) – Path to the SQLite catalog database.
Return type:: list[dict[str, Any]]
Returns:: A list of per-skill dicts (each without ingested_at); empty if the database file does not exist.

classifiers.skill_catalog.read_skill_body(skill_md_path)[source]

Read a SKILL.md and split its markdown body from the raw text.

Loads the file and strips the leading --- fenced YAML frontmatter, returning both the body alone (for presenting/activating the skill) and the untouched full text (for callers that still need the frontmatter). When no frontmatter is present the whole file is treated as the body.

Touches the filesystem (reads skill_md_path); no other I/O. Called by the activate_skill tool when surfacing a skill’s instructions. No other in-repo callers were found.

Parameters:: skill_md_path (Path) – Path to the SKILL.md file to read.
Returns:: the markdown body with frontmatter removed and whitespace-stripped, and the original unmodified file contents.
Return type:: tuple[str, str]

classifiers.skill_catalog.skill_embedding_text(name, description)[source]

Build the text representation of a skill for semantic embedding.

Joins a skill’s name and description into the single string that gets embedded for vector search, so a query can be matched against skills by meaning. Keeping this in one helper ensures ingestion and lookup embed skills identically (tier-1 style: name then description, newline-separated).

A pure string helper with no I/O. Called by classifiers.update_skill_embeddings when computing the embedding for each skill row. No other in-repo callers were found.

Parameters:

name (str) – The skill’s name.
description (str) – The skill’s description.

Return type:

str

Returns:

The "name\ndescription" string fed to the embedding model.