rag_system.file_rag_manager module
File-based RAG Manager.
Manages file indexing and retrieval using ChromaDB and OpenRouter embeddings. Returns entire files, not chunks, for complete context.
Features: - URL fetching support - PDF parsing via PyMuPDF - Chunked embeddings for better semantic matching - Full file retrieval on search
- rag_system.file_rag_manager.extract_pdf_text(file_path)[source]
Extract text from a PDF file using PyMuPDF.
- rag_system.file_rag_manager.compress_pdf(file_path, output_path=None, remove_images=True)[source]
Compress a PDF using PyMuPDF’s
ez_save().
- rag_system.file_rag_manager.chunk_text(text, chunk_size=1500, overlap=200)[source]
Split text into overlapping chunks on paragraph/sentence boundaries.
- async rag_system.file_rag_manager.fetch_url_content(url, timeout=30.0)[source]
Fetch content from url. Returns
(bytes, content_type, filename).
- class rag_system.file_rag_manager.FileRAGManager(store_name='default', store_path=None, api_key=None, embedding_model='google/gemini-embedding-001', max_file_size=15728640, gemini_only=True, document_task_type=None, query_task_type=None)[source]
Bases:
objectFile-based RAG with ChromaDB storage and OpenRouter embeddings.
- Parameters:
- __init__(store_name='default', store_path=None, api_key=None, embedding_model='google/gemini-embedding-001', max_file_size=15728640, gemini_only=True, document_task_type=None, query_task_type=None)[source]
Initialize the instance.
- Parameters:
store_name (
str) – The store name value.embedding_model (
str) – The embedding model value.max_file_size (
int) – The max file size value.gemini_only (
bool) – Use only the Gemini API for embeddings.document_task_type (
Optional[str]) – Optional Gemini task type for indexed text (e.g.RETRIEVAL_DOCUMENT).query_task_type (
Optional[str]) – Optional Gemini task type for search queries (e.g.RETRIEVAL_QUERY).
- index_file(file_path, tags=None, use_chunking=True, chunk_size=1500, chunk_overlap=200, force=False)[source]
Index a single file into the collection.
When force is True the content-hash dedup check is skipped so the file is always re-embedded (but the store is not cleared).
- async index_url(url, tags=None, use_chunking=True, chunk_size=1500, chunk_overlap=200)[source]
Index url.
- index_directory(directory_path, recursive=True, tags=None, exclude_patterns=None, max_workers=6, force=False, allowed_extensions=None)[source]
Index all supported files in directory_path.
When max_workers > 1, files are indexed concurrently using a thread pool. Each file’s embedding batches are already parallelised inside the embedding function, so even
max_workers=1benefits from concurrent API calls.force bypasses the per-file content-hash dedup check without clearing the store, so already-indexed files get re-embedded.
When allowed_extensions is set, only files whose suffix (after normalizing to a leading dot, lowercase) appears in the collection are queued;
Nonemeans no extension filter (all supported types under SUPPORTED_EXTENSIONS).
- search(query, n_results=5, tags=None, return_content=True, query_embedding=None, max_content_size=8000)[source]
Semantic search returning relevant chunks per file.
Instead of returning entire file contents, this collects the matching chunk texts that ChromaDB found and merges them (respecting
max_content_size). Small files whose full text fits within one chunk are returned in full automatically.- Parameters:
query (
str) – Natural-language search query.n_results (
int) – Maximum number of files to return.return_content (
bool) – Include chunk text in results.query_embedding (
list[float] |None) – Pre-computed query embedding (skips ChromaDB’s internal embedding call).max_content_size (
int) – Maximum characters of merged chunk text to return per file (default 8000).
- Return type:
- rag_system.file_rag_manager.get_rag_store(store_name='default', api_key=None, max_file_size=None, gemini_only=True, document_task_type=None, query_task_type=None)[source]
Get or create a RAG store by name (LRU-cached).
At most
_STORE_REGISTRY_MAX_SIZEstores are kept open simultaneously. When a new store would exceed the limit the least recently used entry is closed and evicted.Cache entries are keyed by
store_nameplus optional embedding task types so different embedding configurations do not share one client.
- rag_system.file_rag_manager.get_stargazer_docs_store()[source]
Return the shared RAG store for Sphinx / tool documentation.
Uses
RETRIEVAL_DOCUMENTfor indexed chunks andRETRIEVAL_QUERYfor search queries (Gemini embedding task types).- Return type:
- rag_system.file_rag_manager.list_rag_stores()[source]
List all available RAG store directory names.
- rag_system.file_rag_manager.list_rag_stores_with_stats()[source]
List stores with file counts using only filesystem ops (no ChromaDB).
Counts physical files in each store’s
files/subdirectory as a lightweight proxy for the indexed entry count. This never opens a ChromaDB client and therefore uses zero additional RAM.