rag_system.file_rag_manager module

File-based RAG Manager.

Manages file indexing and retrieval using ChromaDB and OpenRouter embeddings. Returns entire files, not chunks, for complete context.

Features: - URL fetching support - PDF parsing via PyMuPDF - Chunked embeddings for better semantic matching - Full file retrieval on search

rag_system.file_rag_manager.extract_pdf_text(file_path)[source]

Extract text from a PDF file using PyMuPDF.

Return type:

Optional[str]

Parameters:

file_path (str)

rag_system.file_rag_manager.compress_pdf(file_path, output_path=None, remove_images=True)[source]

Compress a PDF using PyMuPDF’s ez_save().

Return type:

Tuple[str, int, int]

Parameters:
  • file_path (str)

  • output_path (str | None)

  • remove_images (bool)

rag_system.file_rag_manager.chunk_text(text, chunk_size=1500, overlap=200)[source]

Split text into overlapping chunks on paragraph/sentence boundaries.

Return type:

List[str]

Parameters:
async rag_system.file_rag_manager.fetch_url_content(url, timeout=30.0)[source]

Fetch content from url. Returns (bytes, content_type, filename).

Return type:

Tuple[Optional[bytes], Optional[str], Optional[str]]

Parameters:
class rag_system.file_rag_manager.FileRAGManager(store_name='default', store_path=None, api_key=None, embedding_model='google/gemini-embedding-001', max_file_size=15728640, gemini_only=True, document_task_type=None, query_task_type=None)[source]

Bases: object

File-based RAG with ChromaDB storage and OpenRouter embeddings.

Parameters:
  • store_name (str)

  • store_path (str | None)

  • api_key (str | None)

  • embedding_model (str)

  • max_file_size (int)

  • gemini_only (bool)

  • document_task_type (str | None)

  • query_task_type (str | None)

__init__(store_name='default', store_path=None, api_key=None, embedding_model='google/gemini-embedding-001', max_file_size=15728640, gemini_only=True, document_task_type=None, query_task_type=None)[source]

Initialize the instance.

Parameters:
  • store_name (str) – The store name value.

  • store_path (Optional[str]) – The store path value.

  • api_key (Optional[str]) – The api key value.

  • embedding_model (str) – The embedding model value.

  • max_file_size (int) – The max file size value.

  • gemini_only (bool) – Use only the Gemini API for embeddings.

  • document_task_type (Optional[str]) – Optional Gemini task type for indexed text (e.g. RETRIEVAL_DOCUMENT).

  • query_task_type (Optional[str]) – Optional Gemini task type for search queries (e.g. RETRIEVAL_QUERY).

index_file(file_path, tags=None, use_chunking=True, chunk_size=1500, chunk_overlap=200, force=False)[source]

Index a single file into the collection.

When force is True the content-hash dedup check is skipped so the file is always re-embedded (but the store is not cleared).

Return type:

Dict[str, Any]

Parameters:
async index_url(url, tags=None, use_chunking=True, chunk_size=1500, chunk_overlap=200)[source]

Index url.

Parameters:
  • url (str) – URL string.

  • tags (Optional[List[str]]) – The tags value.

  • use_chunking (bool) – The use chunking value.

  • chunk_size (int) – The chunk size value.

  • chunk_overlap (int) – The chunk overlap value.

Returns:

The result.

Return type:

Dict[str, Any]

index_directory(directory_path, recursive=True, tags=None, exclude_patterns=None, max_workers=6, force=False, allowed_extensions=None)[source]

Index all supported files in directory_path.

When max_workers > 1, files are indexed concurrently using a thread pool. Each file’s embedding batches are already parallelised inside the embedding function, so even max_workers=1 benefits from concurrent API calls.

force bypasses the per-file content-hash dedup check without clearing the store, so already-indexed files get re-embedded.

When allowed_extensions is set, only files whose suffix (after normalizing to a leading dot, lowercase) appears in the collection are queued; None means no extension filter (all supported types under SUPPORTED_EXTENSIONS).

Return type:

Dict[str, Any]

Parameters:
search(query, n_results=5, tags=None, return_content=True, query_embedding=None, max_content_size=8000)[source]

Semantic search returning relevant chunks per file.

Instead of returning entire file contents, this collects the matching chunk texts that ChromaDB found and merges them (respecting max_content_size). Small files whose full text fits within one chunk are returned in full automatically.

Parameters:
  • query (str) – Natural-language search query.

  • n_results (int) – Maximum number of files to return.

  • tags (Optional[List[str]]) – Optional tag filter.

  • return_content (bool) – Include chunk text in results.

  • query_embedding (list[float] | None) – Pre-computed query embedding (skips ChromaDB’s internal embedding call).

  • max_content_size (int) – Maximum characters of merged chunk text to return per file (default 8000).

Return type:

List[Dict[str, Any]]

remove_file(file_path)[source]

Delete the specified file.

Parameters:

file_path (str) – The file path value.

Returns:

The result.

Return type:

Dict[str, Any]

remove_url(url)[source]

Delete the specified url.

Parameters:

url (str) – URL string.

Returns:

The result.

Return type:

Dict[str, Any]

list_indexed_files(limit=100)[source]

List indexed files.

Parameters:

limit (int) – Maximum number of items.

Returns:

The result.

Return type:

List[Dict[str, Any]]

list_store_files()[source]

List store files.

Returns:

The result.

Return type:

List[Dict[str, Any]]

read_store_file(filename)[source]

Read store file.

Parameters:

filename (str) – The filename value.

Returns:

The result.

Return type:

Dict[str, Any]

close()[source]

Release the ChromaDB client and its underlying SQLite resources.

Return type:

None

get_stats()[source]

Retrieve the stats.

Returns:

The result.

Return type:

Dict[str, Any]

clear()[source]

Clear.

Returns:

The result.

Return type:

Dict[str, Any]

rag_system.file_rag_manager.get_rag_store(store_name='default', api_key=None, max_file_size=None, gemini_only=True, document_task_type=None, query_task_type=None)[source]

Get or create a RAG store by name (LRU-cached).

At most _STORE_REGISTRY_MAX_SIZE stores are kept open simultaneously. When a new store would exceed the limit the least recently used entry is closed and evicted.

Cache entries are keyed by store_name plus optional embedding task types so different embedding configurations do not share one client.

Return type:

FileRAGManager

Parameters:
  • store_name (str)

  • api_key (str | None)

  • max_file_size (int | None)

  • gemini_only (bool)

  • document_task_type (str | None)

  • query_task_type (str | None)

rag_system.file_rag_manager.get_stargazer_docs_store()[source]

Return the shared RAG store for Sphinx / tool documentation.

Uses RETRIEVAL_DOCUMENT for indexed chunks and RETRIEVAL_QUERY for search queries (Gemini embedding task types).

Return type:

FileRAGManager

rag_system.file_rag_manager.list_rag_stores()[source]

List all available RAG store directory names.

Return type:

List[str]

rag_system.file_rag_manager.list_rag_stores_with_stats()[source]

List stores with file counts using only filesystem ops (no ChromaDB).

Counts physical files in each store’s files/ subdirectory as a lightweight proxy for the indexed entry count. This never opens a ChromaDB client and therefore uses zero additional RAM.

Return type:

List[Dict[str, Any]]

rag_system.file_rag_manager.delete_rag_store(store_name)[source]

Delete a RAG store completely.

Return type:

Dict[str, Any]

Parameters:

store_name (str)