classifiers.build_tool_index module

Build the tool index used by the vector classifier.

Auto-discovers every registered tool via tool_loader, then calls an LLM to generate 50 diverse synthetic user queries per tool (reverse-HyDE). Results are saved to tool_index_data.json in this directory.

Usage:

python -m classifiers.build_tool_index [--tools-dir tools]
classifiers.build_tool_index.discover_invalid_query_index_tools(index_data, registered, *, expected_count=50)[source]

Return tool names whose stored index entry cannot drive embeddings.

Invalid: missing key, non-dict value, synthetic_queries missing or not a list, fewer than expected_count items, or any of the first expected_count entries not a non-empty string (after strip).

Used by classifiers.refresh_tool_embeddings to redo query generation only where tool_index_data.json is incomplete or malformed (e.g. failed or partial LLM JSON).

Return type:

list[str]

Parameters:
async classifiers.build_tool_index.generate_synthetic_queries(client, base_url, api_key, tool_name, tool_description, count=50, model='gemini-3.1-flash-lite', *, openrouter_only=False)[source]

Produce count synthetic queries via Gemini, with model fallbacks.

When openrouter_only is True, skips Gemini and uses OpenRouter chat with google/{_OPENROUTER_ONLY_QUERY_MODEL} only (requires OPENROUTER_* or API_KEY).

Tries the primary model, then gemini-2.5-flash and gemini-3-flash-preview on the same API key when generateContent returns a retriable HTTP status. If every Gemini model fails for the round, falls back to OpenRouter chat completions with google/<primary model name>. OpenRouter key: prefer OPENROUTER_QUERY_GEN_API_KEY, else OPENROUTER_API_KEY, else API_KEY. Embeddings are unaffected — this path is query generation only.

Invalid JSON or too few queries: the same model is called again immediately (up to GEMINI_QUERY_GEN_JSON_PARSE_MAX_RETRIES_SAME_MODEL times per model), then the next model / outer round. Partial JSON may be recovered by extracting the first {...} block.

429 responses wait with exponential backoff and retry the same Gemini model (see GEMINI_QUERY_GEN_429_*); OpenRouter 429 uses OPENROUTER_QUERY_GEN_429_*.

Retries use exponential backoff. After _QUERY_GEN_PAID_AFTER_FAILURES consecutive full-round failures, gemini_embed_pool.get_paid_fallback_key() is used (if set). Env: GEMINI_QUERY_GEN_* (see module constants).

Raises:

RuntimeError – if valid queries cannot be produced after all attempts.

Return type:

list[str]

Parameters:
  • client (httpx.AsyncClient)

  • base_url (str | None)

  • api_key (str | None)

  • tool_name (str)

  • tool_description (str)

  • count (int)

  • model (str)

  • openrouter_only (bool)

async classifiers.build_tool_index.build_index(tools_dir='tools')[source]

Discover every tool and generate its synthetic queries into the index file.

The top-level driver for the reverse-HyDE index build: it auto-discovers all registered tools, generates SYNTHETIC_QUERY_COUNT synthetic user queries per tool, and writes the merged result to tool_index_data.json. That JSON is the input later consumed by the embedding initializers to populate the vector classifier, so this is the first stage of the tool-routing pipeline.

It builds a tools.ToolRegistry and loads it via tool_loader.load_tools(), reads any existing OUTPUT_FILE to skip tools that already have enough queries (resumable), and fans out generation across an asyncio.Semaphore(3) of three concurrent workers. Each worker calls generate_synthetic_queries() over a shared httpx.AsyncClient (Gemini with OpenRouter fallback), and the final dict is written back to OUTPUT_FILE on disk. Progress is logged throughout. Invoked only from this module’s __main__ guard via asyncio.run; no other callers were found.

Parameters:

tools_dir (str) – Directory scanned for tool modules. Defaults to "tools".

Return type:

None

Returns:

None. Side effects are the synthetic-query writes to tool_index_data.json.