classifiers.ingest_skills module

Scan corpus roots for SKILL.md files and populate the SQLite skills index.

CLI and helper for the skills-corpus ingest stage: it walks one or more corpus roots, discovers every SKILL.md under them, and upserts the parsed metadata into the on-disk SQLite skills index that the runtime skill catalog reads from. Optional body-hash dedupe keeps copies of the same skill from being indexed more than once.

Heavy lifting lives in classifiers.skill_catalog (discovery, parsing, stable id generation, and the database upserts); this module wraps it with the ingest_roots() driver and a main() argument-parsing entry point. The module is runnable as python -m classifiers.ingest_skills and is also invoked as a subprocess by the skills-corpus reconcile and pipeline scripts.

classifiers.ingest_skills.ingest_roots(corpus_roots, db_path, *, dedupe_by_body_hash=True)[source]

Scan the given corpus roots and upsert every discovered skill into SQLite.

The core ingest driver: for each resolved corpus root it discovers skill directories, parses each SKILL.md, optionally drops body-hash duplicates, assigns a stable skill id, and writes one row per surviving skill into the SQLite index. This is what keeps the searchable skill catalog in sync with the on-disk corpus.

It ensures the schema exists via classifiers.skill_catalog.init_db(), enumerates directories with classifiers.skill_catalog.discover_skill_dirs() sorted by classifiers.skill_catalog.canonical_skill_sort_key(), parses files with classifiers.skill_catalog._parse_skill_md, derives ids via classifiers.skill_catalog.stable_skill_id(), and persists rows through classifiers.skill_catalog.upsert_skill(). Its side effects are the SQLite writes at db_path and INFO/WARNING logging; missing roots and unparseable or duplicate files are counted as skips, not errors. Called by main(), by scripts/skills_corpus_pipeline.py, and by tests/test_skill_catalog.py; no other callers were found.

Parameters:
  • corpus_roots (list[Path]) – Directories to scan, each possibly containing nested skills.

  • db_path (Path) – Path to the SQLite skills index to create and write.

  • dedupe_by_body_hash (bool) – When True (default), the first skill seen for a given body hash wins and later identical bodies are skipped.

Returns:

(inserted_or_updated, skipped) counts across all roots.

Return type:

tuple[int, int]

classifiers.ingest_skills.main()[source]

Run the SKILL.md ingestion as a standalone CLI entry point.

Parses command-line arguments (--roots, --db, --no-dedupe-body-hash), resolves the corpus roots to scan, ingests every discovered SKILL.md into the SQLite skills index, and logs a summary of how many rows were upserted, skipped, and now present.

When no --roots are supplied it falls back to the configured corpus roots by importing config.Config and reading Config.load().skills_corpus_roots; if that yields nothing it logs an error and aborts. The actual scan and database writes are delegated to ingest_roots(), and the final total is computed via classifiers.skill_catalog.load_all_skills.

Called by the module’s __main__ guard (raise SystemExit(main())), so it is the process entry point when run as python -m classifiers.ingest_skills — for example the subprocess spawned by scripts/reconcile_skills_sqlite.py. (No in-process Python callers invoke main directly; the pipeline in scripts/skills_corpus_pipeline.py calls ingest_roots() instead.)

Returns:

0 on successful ingestion, or 1 when no corpus roots could be resolved from arguments or config.

Return type:

int