classifiers.redis_vector_index module
RediSearch KNN helpers for tool/skill/dangerous-command centroid embeddings.
Stores one Redis HASH per item under tool_emb:{name} / skill_emb:{id} /
dangerous_cmd_emb:{category_id} / benign_tech_emb:{category_id} and
queries via FT.SEARCH (see init_redis_indexes).
Legacy monolithic hashes (TOOL_EMBEDDINGS_HASH_KEY, etc.) remain supported
as a fallback until fully migrated.
- classifiers.redis_vector_index.embedding_to_blob(vec)[source]
Serialize an embedding vector to a FLOAT32 blob for RediSearch.
Converts a numpy array or float list into the little-endian FLOAT32 byte string that RediSearch expects for both stored
embeddingHASH fields and the$query_vecquery parameter. The dimension is validated againstVECTOR_DIMup front so a misshapen vector fails loudly here rather than producing a silently corrupt index entry or query.This is a pure transform with no side effects. It is called by every store helper in this module (
store_tool_embedding_hash(),store_skill_embedding_hash(),store_dangerous_cmd_embedding_hash(),store_benign_tech_embedding_hash()) and by every KNN search helper to encode the query vector, and is exercised directly bytests/test_vector_redisearch_knn.py.- Parameters:
vec (
ndarray|list[float]) – Embedding as a numpy array or a list of floats.- Returns:
The vector encoded as a contiguous little-endian FLOAT32 blob.
- Return type:
- Raises:
ValueError – If the vector’s element count does not equal
VECTOR_DIM.
- async classifiers.redis_vector_index.store_tool_embedding_hash(redis, tool_name, centroid, metadata)[source]
Upsert the per-tool RediSearch HASH holding a tool’s centroid embedding.
Writes (or overwrites) the
tool_emb:{tool_name}HASH with the FLOAT32embeddingblob, the plainnamefield, and a JSON-encodedmeta_jsonblob, which registers the tool as a document in the RediSearch tool index so it becomes reachable viaknn_search_tools(). This is the inverse ofdelete_tool_embedding_hash()and the per-key successor to the legacy monolithic tool hash.The embedding is encoded via
embedding_to_blob()(validating its dimension) and the function issues a singleHSETagainst Redis. It is called by the tool-embedding build and refresh scripts (classifiers/update_tool_embeddings.py,classifiers/refresh_tool_embeddings.py,classifiers/update_changed_tool_embeddings.py), bymigrate_legacy_tool_hashes_to_redisearch()when porting legacy hashes, by the in-process rebuild path inclassifiers/vector_classifier.py, and bytests/test_vector_redisearch_knn.py.- Parameters:
- Return type:
- Returns:
None.
- async classifiers.redis_vector_index.delete_tool_embedding_hash(redis, tool_name)[source]
Delete the per-tool RediSearch HASH for
tool_name.Removes the
tool_emb:{tool_name}key (the inverse ofstore_tool_embedding_hash()), which also drops the document from the RediSearch tool index so it is no longer returned by KNN queries.This issues a single
DELagainsttool_emb:{tool_name}and has no other side effects. It is called byclassifiers/update_tool_embeddings.pywhen pruning orphaned tool entries during an embedding rebuild (alongsideHDELon the legacy monolithic hashes), and bytests/test_vector_redisearch_knn.pyfor fixture cleanup.
- async classifiers.redis_vector_index.store_skill_embedding_hash(redis, skill_id, vec, meta)[source]
Upsert the per-skill RediSearch HASH holding a skill’s embedding.
Writes (or overwrites) the
skill_emb:{skill_id}HASH with the FLOAT32embeddingblob plus searchableskill_id,name,description, and JSONmeta_jsonfields (thenameanddescriptionfalling back to the skill id and empty string when absent frommeta). This registers the skill as a document in the RediSearch skill index so it is returned byknn_search_skills(), and is the inverse ofdelete_skill_embedding_hash().The embedding is encoded via
embedding_to_blob()and the function issues a singleHSETagainst Redis. It is called byclassifiers/update_skill_embeddings.pyduring a skill-embedding rebuild, bymigrate_legacy_skill_hashes_to_redisearch()when porting legacy hashes, and bytests/test_vector_redisearch_knn.py.- Parameters:
redis (
Redis) – Async Redis client used to issue the write.skill_id (
str) – Skill identifier, used as the key suffix andskill_idfield.vec (
ndarray) – Embedding vector for the skill.meta (
dict[str,Any]) – Skill metadata; itsnameanddescriptionentries populate the corresponding fields and the whole dict is serialized intometa_json.
- Return type:
- Returns:
None.
- async classifiers.redis_vector_index.delete_skill_embedding_hash(redis, skill_id)[source]
Delete the per-skill RediSearch HASH for
skill_id.Removes the
skill_emb:{skill_id}key (the inverse ofstore_skill_embedding_hash()), dropping the document from the RediSearch skill index so it is excluded from subsequent KNN queries.This issues a single
DELagainstskill_emb:{skill_id}with no other side effects. It is called byclassifiers/update_skill_embeddings.pywhen pruning orphaned skill embeddings during a rebuild (alongsideHDELon the legacy monolithic hashes), and bytests/test_vector_redisearch_knn.pyfor fixture cleanup.
- async classifiers.redis_vector_index.store_dangerous_cmd_embedding_hash(redis, category_id, centroid, metadata)[source]
Upsert the per-category dangerous-command RediSearch HASH.
Writes (or overwrites) the
dangerous_cmd_emb:{category_id}HASH with the FLOAT32embeddingblob, thecategory_idfield, and a JSONmeta_jsonblob, registering the category as a document in the dangerous-command index so the guard can match against it viaknn_search_dangerous_cmds(). This is the inverse ofdelete_dangerous_cmd_embedding_hash().The centroid is encoded via
embedding_to_blob()and the function issues a singleHSETagainst Redis. It is called byclassifiers/update_dangerous_command_embeddings.pywhen (re)building the dangerous-command corpus, and bytests/test_vector_redisearch_knn.py.- Parameters:
- Return type:
- Returns:
None.
- async classifiers.redis_vector_index.delete_dangerous_cmd_embedding_hash(redis, category_id)[source]
Delete the per-category dangerous-command RediSearch HASH.
Removes the
dangerous_cmd_emb:{category_id}key (the inverse ofstore_dangerous_cmd_embedding_hash()), dropping the document from the dangerous-command index so it no longer participates in the guard’s KNN matching.This issues a single
DELagainstdangerous_cmd_emb:{category_id}with no other side effects. It is called byclassifiers/update_dangerous_command_embeddings.pywhen pruning orphaned categories during a corpus rebuild, and bytests/test_vector_redisearch_knn.pyfor fixture cleanup.
- async classifiers.redis_vector_index.store_benign_tech_embedding_hash(redis, category_id, centroid, metadata)[source]
Upsert the per-category benign-technical RediSearch HASH.
Writes (or overwrites) the
benign_tech_emb:{category_id}HASH with the FLOAT32embeddingblob, thecategory_idfield, and a JSONmeta_jsonblob, registering the category as a document in the benign-technical index so the guard can match against it viaknn_search_benign_tech()(the “looks dangerous but is actually benign” allow-list counterpart to the dangerous-command index). This is the inverse ofdelete_benign_tech_embedding_hash().The centroid is encoded via
embedding_to_blob()and the function issues a singleHSETagainst Redis. It is called byclassifiers/update_benign_technical_embeddings.pywhen (re)building the benign-technical corpus, and bytests/test_vector_redisearch_knn.py.- Parameters:
- Return type:
- Returns:
None.
- async classifiers.redis_vector_index.delete_benign_tech_embedding_hash(redis, category_id)[source]
Delete the per-category benign-technical RediSearch HASH.
Removes the
benign_tech_emb:{category_id}key (the inverse ofstore_benign_tech_embedding_hash()), dropping the document from the benign-technical index so it no longer participates in KNN matching.This issues a single
DELagainstbenign_tech_emb:{category_id}with no other side effects. It is called byclassifiers/update_benign_technical_embeddings.pywhen pruning orphaned categories during a rebuild, and bytests/test_vector_redisearch_knn.pyfor fixture cleanup.
- async classifiers.redis_vector_index.knn_search_tools(redis, query_embedding, *, knn_k, ef_runtime=200)[source]
Find the tools whose centroid embeddings are nearest a query vector.
Runs an approximate
FT.SEARCHKNN query over the RediSearch tool index for theknn_knearest neighbours ofquery_embedding, converting each match’s cosine distance into a cosine similarity (1 - distance) and parsing the storedmeta_jsonback into a dict. Results come out sorted by ascending distance, i.e. descending similarity. Any RediSearch error is swallowed (logged at debug) and yields an empty list so the classifier degrades gracefully when the index is missing or unhealthy.Internally it encodes the query via
embedding_to_blob(), builds the clause with_knn_clause(), and reads result fields through_doc_str(). It is called bytools/search_tools.pyand by the tool vector classifier inclassifiers/vector_classifier.pyto route requests to candidate tools, and bytests/test_vector_redisearch_knn.py.- Parameters:
- Returns:
Per-match dicts with
name,score(cosine similarity), andmetadata(parsedmeta_json), ordered most-similar first; empty on any search failure.- Return type:
- async classifiers.redis_vector_index.knn_search_skills(redis, query_embedding, *, knn_k, ef_runtime=200)[source]
Find the skills whose embeddings are nearest a query vector.
Runs an approximate
FT.SEARCHKNN query over the RediSearch skill index for theknn_knearest neighbours ofquery_embedding, converting each match’s cosine distance into a cosine similarity (1 - distance) and parsing the storedmeta_json. Each result’snameanddescriptioncome from the indexed fields, falling back tometa_json(then the skill id / empty string) when those fields are blank. Results are ordered descending by similarity, and any RediSearch error is swallowed (logged at debug) and returns an empty list.Internally it encodes the query via
embedding_to_blob(), builds the clause with_knn_clause(), and reads fields through_doc_str(). It is called by the skill vector classifier inclassifiers/vector_classifier.pyto surface candidate skills, and bytests/test_vector_redisearch_knn.py.- Parameters:
- Returns:
Per-match dicts with
skill_id,name,description,score(cosine similarity), andmetadata, ordered most-similar first; empty on any search failure.- Return type:
- async classifiers.redis_vector_index.knn_search_dangerous_cmds(redis, query_embedding, *, knn_k, ef_runtime=200)[source]
Find the dangerous-command categories nearest a query vector.
Runs an approximate
FT.SEARCHKNN query over the dangerous-command index for theknn_knearest neighbours ofquery_embedding, converting each match’s cosine distance into a cosine similarity (1 - distance) and parsing the storedmeta_json. Results are ordered descending by similarity, and any RediSearch error is swallowed (logged at debug) and returns an empty list so a missing index fails open rather than crashing the guard.Internally it encodes the query via
embedding_to_blob(), builds the clause with_knn_clause(), and reads fields through_doc_str(). It is called byclassifiers/dangerous_command_guard.pyto decide whether a candidate command resembles a known dangerous category, and bytests/test_vector_redisearch_knn.py.- Parameters:
- Returns:
Per-match dicts with
category_id,score(cosine similarity), andmetadata, ordered most-similar first; empty on any search failure.- Return type:
- async classifiers.redis_vector_index.knn_search_benign_tech(redis, query_embedding, *, knn_k, ef_runtime=200)[source]
Find the benign-technical categories nearest a query vector.
Runs an approximate
FT.SEARCHKNN query over the benign-technical index for theknn_knearest neighbours ofquery_embedding, converting each match’s cosine distance into a cosine similarity (1 - distance) and parsing the storedmeta_json. This is the allow-list counterpart toknn_search_dangerous_cmds(): a strong benign match lets the guard clear a command that merely looks dangerous. Results are ordered descending by similarity, and any RediSearch error is swallowed (logged at debug) and returns an empty list.Internally it encodes the query via
embedding_to_blob(), builds the clause with_knn_clause(), and reads fields through_doc_str(). It is called byclassifiers/dangerous_command_guard.py(typically withknn_k=1for the nearest benign category) and bytests/test_vector_redisearch_knn.py.- Parameters:
- Returns:
Per-match dicts with
category_id,score(cosine similarity), andmetadata, ordered most-similar first; empty on any search failure.- Return type:
- async classifiers.redis_vector_index.redisearch_index_doc_count(redis, index_name)[source]
Report how many documents a RediSearch index currently holds.
Calls
FT.INFOonindex_nameand reads back thenum_docsfield, coercing whatever Redis returns (bytes or str keys, string counts) into anint. It is the cheap liveness/population check the classifiers and guard run before issuing a KNN query, so they can skip the search entirely when an index is empty or absent. Every failure path – a missing index, a non-dict reply, a missing or unparseablenum_docs– returns-1rather than raising.The only side effect is the single
FT.INFOround trip. It is called by the tool and skill vector classifiers inclassifiers/vector_classifier.py, byclassifiers/dangerous_command_guard.py(for both the dangerous-command and benign-technical indexes), bytools/search_tools.py, and bytests/test_vector_redisearch_knn.py.
- async classifiers.redis_vector_index.scan_tool_names(redis)[source]
Enumerate every tool name that currently has a stored embedding HASH.
Iterates the keyspace with a cursor-based
SCANover thetool_emb:*pattern (in batches of 500), strips theTOOL_EMB_PREFIXoff each matching key, and returns the deduplicated, sorted list of tool names.SCANis used instead ofKEYSso the walk stays non-blocking on a large keyspace. The vector classifier uses this list to know which tools exist so it can expand tool-name prefixes when routing.The only side effect is the sequence of
SCANcalls against Redis. It is called by the tool vector classifier inclassifiers/vector_classifier.pyto populate its cached tool-name list.
- async classifiers.redis_vector_index.scan_dangerous_cmd_category_ids(redis)[source]
Enumerate every dangerous-command category id with a stored embedding.
Iterates the keyspace with a cursor-based
SCANover thedangerous_cmd_emb:*pattern (in batches of 500), strips theDANGEROUS_CMD_EMB_PREFIXoff each matching key, and returns the deduplicated, sorted category ids.SCANkeeps the walk non-blocking on a large keyspace.The only side effect is the sequence of
SCANcalls against Redis. It is called byclassifiers/update_dangerous_command_embeddings.pyto learn which categories already exist so it can prune ones that have been removed from the source corpus.
- async classifiers.redis_vector_index.scan_benign_tech_category_ids(redis)[source]
Enumerate every benign-technical category id with a stored embedding.
Iterates the keyspace with a cursor-based
SCANover thebenign_tech_emb:*pattern (in batches of 500), strips theBENIGN_TECH_EMB_PREFIXoff each matching key, and returns the deduplicated, sorted category ids.SCANkeeps the walk non-blocking on a large keyspace.The only side effect is the sequence of
SCANcalls against Redis. It is called byclassifiers/update_benign_technical_embeddings.pyto learn which categories already exist so it can prune ones that have been removed from the source corpus.
- async classifiers.redis_vector_index.migrate_legacy_tool_hashes_to_redisearch(redis, *, embeddings_key, metadata_key)[source]
Backfill per-tool RediSearch HASHs from the legacy monolithic tool hashes.
Reads the two legacy monolithic hashes –
embeddings_key(field per tool -> JSON vector) andmetadata_key(field per tool -> JSON metadata) – and, for each tool, decodes its vector into a FLOAT32 numpy array and writes a per-keytool_emb:HASH viastore_tool_embedding_hash(), which is what migrates the data into the new RediSearch-indexed layout. Tools whose vector JSON fails to parse are skipped; metadata lookups tolerate both bytes and str field keys and fall back to{"name": name}when absent or unparseable.Side effects are the two
HGETALLreads plus oneHSETper migrated tool, and an info-level log of the count. It is called by the one-shot migration scriptclassifiers/migrate_embeddings_redisearch.py.- Parameters:
- Returns:
The number of per-tool HASHs written (0 if the legacy embeddings hash is empty or missing).
- Return type:
- async classifiers.redis_vector_index.migrate_legacy_skill_hashes_to_redisearch(redis, *, embeddings_key, metadata_key)[source]
Backfill per-skill RediSearch HASHs from the legacy monolithic skill hashes.
Reads the two legacy monolithic hashes –
embeddings_key(field per skill -> JSON vector) andmetadata_key(field per skill -> JSON metadata) – and, for each skill, decodes its vector into a FLOAT32 numpy array and writes a per-keyskill_emb:HASH viastore_skill_embedding_hash(), migrating the data into the RediSearch-indexed layout. Skills whose vector JSON fails to parse are skipped; metadata lookups tolerate both bytes and str field keys and fall back to{"skill_id": sid}when absent or unparseable. This is the skill counterpart tomigrate_legacy_tool_hashes_to_redisearch().Side effects are the two
HGETALLreads plus oneHSETper migrated skill, and an info-level log of the count. It is called by the one-shot migration scriptclassifiers/migrate_embeddings_redisearch.py.- Parameters:
- Returns:
The number of per-skill HASHs written (0 if the legacy embeddings hash is empty or missing).
- Return type: